[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881830#comment-13881830 ] Hudson commented on HDFS-4949: -- FAILURE: Integrated in Hadoop-Yarn-trunk #461 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/461/]) Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Fix For: 2.4.0 Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881857#comment-13881857 ] Hudson commented on HDFS-4949: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1678 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1678/]) Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Fix For: 2.4.0 Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881875#comment-13881875 ] Hudson commented on HDFS-4949: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1653 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1653/]) Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Fix For: 2.4.0 Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879143#comment-13879143 ] Colin Patrick McCabe commented on HDFS-4949: branch-2 patch looks good to me; thanks Andrew Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879198#comment-13879198 ] Andrew Wang commented on HDFS-4949: --- Thanks Colin. The branch-2 test run I did with this also came back clean, so I think it's good to go. Will commit shortly. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879235#comment-13879235 ] Andrew Wang commented on HDFS-4949: --- I've committed this to branch-2 and fixed up the CHANGES.txt in branch-2 and trunk accordingly. It might finally be time to resolve this parent issue, and punt all remaining subtasks out into their own standalone issues. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879301#comment-13879301 ] Hudson commented on HDFS-4949: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5035 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5035/]) Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Fix For: 2.4.0 Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877995#comment-13877995 ] Hadoop QA commented on HDFS-4949: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12624211/hdfs-4949-branch-2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5930//console This message is automatically generated. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855882#comment-13855882 ] Arun C Murthy commented on HDFS-4949: - We've been discussing through a proposal where-in we can leverage YARN's resource-management *and* workload-management capabilities (via delegation of resources, in this case RAM to HDFS) to provide a more general cache administration in YARN-1488. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838467#comment-13838467 ] Fengdong Yu commented on HDFS-4949: --- [~cnauroth], yes, I can find CacheManager in the trunk, but it's not consistent with HDFS-4949 branch. Does that always committed in HDFS-4949 even after merged with trunk? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838471#comment-13838471 ] Andrew Wang commented on HDFS-4949: --- Hey [~azuryy], since the merge, we've been committing caching-related patches just to trunk. The branch is now defunct. We're still using this JIRA for tracking ongoing subtasks though, and some JIRAs also have the caching label. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838481#comment-13838481 ] Fengdong Yu commented on HDFS-4949: --- Thanks, Andrew. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836687#comment-13836687 ] Chris Nauroth commented on HDFS-4949: - Hi, [~azuryy]. The merge vote passed and HDFS-4949 was merged to trunk about a month ago. For example, here you can see the {{CacheManager}} class on trunk: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/CacheManager.java Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836321#comment-13836321 ] Fengdong Yu commented on HDFS-4949: --- Vote thread has been started for a month, but HDFS-4949 branch still not merged with trunk, does that has any blockers here? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807867#comment-13807867 ] Hudson commented on HDFS-4949: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #377 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/377/]) Merge HDFS-4949 branch back into trunk (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/PathBasedCacheDescriptor.java *
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807887#comment-13807887 ] Hudson commented on HDFS-4949: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1567 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1567/]) Merge HDFS-4949 branch back into trunk (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/PathBasedCacheDescriptor.java *
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807952#comment-13807952 ] Hudson commented on HDFS-4949: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1593 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1593/]) Merge HDFS-4949 branch back into trunk (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java *
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807135#comment-13807135 ] Chris Nauroth commented on HDFS-4949: - +1 for the merge. Thanks again, Andrew and Colin. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807547#comment-13807547 ] Hudson commented on HDFS-4949: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4664 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4664/]) Merge HDFS-4949 branch back into trunk (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/PathBasedCacheDescriptor.java *
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805435#comment-13805435 ] Suresh Srinivas commented on HDFS-4949: --- Given that the merge vote thread has been started, can someone post details about what functionality from original design has been completed and what is pending? Looks like the current functionality does not cover quota management. Any other features pending? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805463#comment-13805463 ] Chris Nauroth commented on HDFS-4949: - Here is a list of items discussed in the design doc to be completed later, with corresponding jira if it exists: * quota enforcement - need to file jira? * cache expiry based on TTL - need to file jira? * incremental cache reports - HDFS-5092 * metrics - HDFS-5320 Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805791#comment-13805791 ] Colin Patrick McCabe commented on HDFS-4949: quick note: The reason why TestOfflineEditsViewer failed in jenkins is that the consolidated patch didn't change the binary edit log file used by that test. It succeeds on the branch. bq. incremental cache reports - HDFS-5092 This is listed as a maybe in the design doc-- it's something that we want to evaluate before doing quotas, metrics, and TTL are definitely post-merge, I think. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804816#comment-13804816 ] Hadoop QA commented on HDFS-4949: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12610186/caching-design-doc-2013-10-24.pdf against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5272//console This message is automatically generated. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804828#comment-13804828 ] Chris Nauroth commented on HDFS-4949: - Thank you, Colin and Stephen. The design doc and test plan LGTM. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804840#comment-13804840 ] Hadoop QA commented on HDFS-4949: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12610166/HDFS-4949-consolidated.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 29 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1551 javac compiler warnings (more than the trunk's current 1548 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 5 warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.web.TestJsonUtil org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5270//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5270//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5270//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5270//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5270//console This message is automatically generated. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805021#comment-13805021 ] Hadoop QA commented on HDFS-4949: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12610221/HDFS-4949-consolidated.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 30 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1551 javac compiler warnings (more than the trunk's current 1548 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5273//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5273//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5273//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5273//console This message is automatically generated. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740145#comment-13740145 ] Andrew Wang commented on HDFS-4949: --- Hi Arun, On the read path comments, it might be elucidating to check out the zero-copy read API that Colin's working on at HDFS-4953. The idea is that clients always use the zero copy cursor to do reads, which behind the scenes will do an mmap'd read if the block is cached, or a normal copying read if the block is on disk or remote. It allows an {{isCached}}-type check via not setting a fallback buffer for copying reads. This will cause the cursor to throw an exception on read if the block is not cached. Finally, there's also a parameter for enabling short reads, which comes into play when a read spans block files. On YARN integration, I'd like to revisit that a little ways down the road since we're focusing on getting a basic prototype out. If you want to get started on it now, it'd be helpful if you could review the current RM plan in the doc, and sketch out how a YARN-based architecture would look. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736019#comment-13736019 ] Arun C Murthy commented on HDFS-4949: - bq. 1. The main reason we added auto-caching of new files was actually for Hive. My understanding is that Hive users can drop new files into a Hive partition directory without notifying the Hive metastore, e.g. via the fs shell. Usually partitions in Hive are new directories. So every 5 or 10 or 15 mins a new directory is added along with new data. Hence, the ability to automatically cache new files seems redundant. bq. 2. We were planning on extending the existing getFileBlockLocations API (which takes a Path, offset, and length) to also indicate which replicas of the returned blocks are cached. This should satisfy the needs of framework schedulers like MR or Impala. [~andrew.wang] Agree that the enhancement to getFileBlockLocations suffices for the scheduler. However, at read time it will be very useful to get an indicator on whether it's cached or not during open. The RecordReader needs this API to decide whether to do stream-based reads (when data isn't cached in RAM) or mmap the file (when it's cached). It would be unfortunate to have to do another call to getFileBlockLocations to validate during read time. For e.g. SequenceFileRecordReader.initialize would look something like: {code:title=SequenceFileRecordReader.java} public void initialize(InputSplit split, TaskAttemptContext context ) throws IOException, InterruptedException { // ... StreamOrCached splitData = split.getPath().open(fileSplit.getStart(), fileSplit.getLength(); InputStream in = null; if (in.isCached()) { in = new ByteBufferInputStream(splitData.getByteBuffer()); } else { in = splitData.getFSDataInputStream(); } // Now use in // ... {code} So, having the open api which returns something like StreamOrCached will be useful as sketched above. Open to other ideas, but hopefully I put across what I'm looking for. Thoughts? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736020#comment-13736020 ] Arun C Murthy commented on HDFS-4949: - bq. Tying in YARN would definitely be great. There's half a hope that we can jump right from a prototype naive scheme to using YARN directly ... I'm happy to help to get that done, let's discuss more. Agree that having right abstractions is important. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734926#comment-13734926 ] Andrew Wang commented on HDFS-4949: --- Hi Tsuyoshi, HDFS-4953 allows applications to do zero-copy reads, so when combined with this JIRA, HDFS will be able to provide full memory-bandwidth reads on cached data. Deserialization is a somewhat separate concern since it happens at the application-level though. If an app can operate directly on the raw bytes in a file (e.g. a ByteBuffer), then it can avoid deserialization overhead. IIUC, this is untrue of the current MR input formats. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735382#comment-13735382 ] Suresh Srinivas commented on HDFS-4949: --- bq. As a meta-point, I think much of the remaining resource management design can wait until after we get the initial end-to-end implementation going. +1 for this. There are many loose ends to be tied and details to be figured out in the design. But the basic implementation could start right away. Some things that we should get to sooner than later: - Pool abstraction and making sure all the APIs are using them (including cache creation and deletion) - Some details related to how the stream oriented APIs change to buffer oriented access. The real quota management, counting common cached data to different pools etc. can be revisited later. Will take a look at the updated doc soon. Thanks Andrew. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735632#comment-13735632 ] Arun C Murthy commented on HDFS-4949: - bq. As a meta-point, I think much of the remaining resource management design can wait until after we get the initial end-to-end implementation going. Makes sense. I, for one, would volunteer to help you guys do resource-management directly via YARN rather than go the route of inventing half of YARN RM within HDFS. It would benefit both HDFS (simpler, plus ability to use memory dynamically between applications and for caching) YARN (more robust for a diverse set of applications). Any takers? Thanks. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735643#comment-13735643 ] Arun C Murthy commented on HDFS-4949: - [~andrew.wang] overall it's looks great, some more questions: # I'm not sure you want to automatically add new files in a directory to the cache, it seems a higher-level system (Hive, Impala, HCat) are in better position. Not doing this automatically simplifies cache mgmt, quota mgmt etc. # Can you please provide details on the read apis? For the Hive/MR/Pig use case I'd like to see a new open(Path, offset, length) which returns an indicator for whether the block is cached or not. This, for e.g., would be used by the RecordReader to read the split. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735723#comment-13735723 ] Andrew Wang commented on HDFS-4949: --- Hey Arun, thanks for taking a look! Tying in YARN would definitely be great. There's half a hope that we can jump right from a prototype naive scheme to using YARN directly, but our resource management team doesn't have time in the near term to make this happen. I definitely want our abstractions to be as similar as possible though to ease a future transition; your input there is appreciated. As to your other points: 1. The main reason we added auto-caching of new files was actually for Hive. My understanding is that Hive users can drop new files into a Hive partition directory without notifying the Hive metastore, e.g. via the fs shell. Since we'd like to provide the abstraction of caching higher-level abstractions like Hive partitions or tables, this auto-caching is necessary. 2. We were planning on extending the existing getFileBlockLocations API (which takes a Path, offset, and length) to also indicate which replicas of the returned blocks are cached. This should satisfy the needs of framework schedulers like MR or Impala. At read time, we'll also provide per-stream statistics of the number of bytes read remotely vs. local disk vs. local memory. Remote memory reads are also on our mind, but will likely be a per-stream or per-client config option added later. Suresh, to partially address your questions, Colin's going to put pools into the patch at HDFS-5052, and he's also been working on buffer-oriented access at HDFS-4953. Thanks for your comments on the subtasks thus far. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731077#comment-13731077 ] Colin Patrick McCabe commented on HDFS-4949: I created a branch for this (HDFS-4949) Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731143#comment-13731143 ] Suresh Srinivas commented on HDFS-4949: --- My notes from the meeting: Enabling this feature on windows platform requires the following: # Need Unix Domain sockets equivalent # mmap and munmap is done using java and should not require any windows specific changes # mlock there is no windows equivalent? Quota for datanode cache is counted against pool Design needs to cover the following scenarios in more detail: # Two pools caching the same file and how is quota counted # Resource failures and how it affects existing caches for the pools. Perhaps pools should have priorities. #* scenario 1 - resource failure takes down cached data. In the first cut, no new cached replicas will be created. #* scenario 2 - resources failed and cluster capacity is low, then the application even if higher priority will not get cache quota. # Caching supported for whole file for now. # Only completed blocks will be cached. This is true for files that are being written. # symlink paths will not be cached # Need to add more details on enabling cache for a directory and how the newly created files (on completion of write) will be added to the cache. This also has quota implications and need for handling failures related to either reaching quota or non-availability of resources for such automatic caching to work. We should add TTL for caching request and expire the cache. I think we should refresh the design document based on discussions from the discussions. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731158#comment-13731158 ] Chris Nauroth commented on HDFS-4949: - bq. mlock there is no windows equivalent? I believe the Windows equivalent of {{mlock}} and {{munlock}} are {{VirtualLock}} and {{VirtualUnlock}}. http://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx http://msdn.microsoft.com/en-us/library/windows/desktop/aa366910(v=vs.85).aspx Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731632#comment-13731632 ] Tsuyoshi OZAWA commented on HDFS-4949: -- Hi, Are there any plan to add APIs to access this caching layer directly from processing framework - e.g. MR? RDD(Spark) paper says that serialization/deserialization of file contents can be the bottleneck of processing. If we have such general APIs using mlock/munlock, it can reduce processing time drastically. Or, this idea is out of scope of this JIRA? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13725704#comment-13725704 ] David S. Wang commented on HDFS-4949: - HDFS-4949 meeting, July 29, 2013 2 PM @ Hortonworks office Attendees Aaron T. Myers Andrew Wang Arpit Gupta Bikas Saha Brandon Li Colin McCabe Dave Wang Jing Zhao Suresh Srinivas Sanjay Radia Todd Lipcon Vinod Kumar Vavilapalli Minutes * General agreement to hold HDFS-2832 meeting some other day * Andrew: Posted HDFS-4949 design doc upstream; Sanjay has read this, agrees with the goals Data path (zero-copy reads) * Sanjay: quota mgmt - counted up front, not after cache is populated * Colin: talking about ZCR (mmap) - used to implement caching at the DN level ** Considered copying everything into /dev/shm (e.g. Tachyon). But cannot cache parts of a file, so limits our flexibility. Also, the associated fd gives clients a way to control memory mgmt (will not release until that descriptor is closed), which is not good because of buggy clients etc. ** Sanjay: you want an abstraction for a durable file. Colin: yes. ** Colin: ZCR currently doesn't have checksums, but will. Todd: assumption is that DN will do the cksum when doing the mlock and communicate that to the client so the client knows that it's safe to read. ** Todd: mincore() can tell you what's already in cache, but it's too granular, very expensive to call, and can be out-of-date immediately. ** Assuming this is for local clients only obviously. ** ZCR uses ByteBuffers to avoid copies. Not entirely compatible with current DFSClient since that uses byte arrays, so you cannot avoid copies. * This may have a conflict with inline checksums. Clients would have to be aware of how to skip over checksums, and this would have to be in the app, not the client since we're talking mmap. ** HBase gets around this by disabling HDFS-level cksums, and doing it on their level. ** Sanjay: QFS puts all of the cksums in the beginning of the file ** Todd: Liang Xie had an HBase study where he figured out that perf didn't improve until he got to a TB of data, when the cksum files themselves dropped out ofcache. * ZCR API can be made public? Colin, Todd: Yes. ** Hard to compete with Spark if this isn't public. ** Suresh: Will the app know if you got ZCR? Can be added as counters. Colin: already have similar concepts today for SCR on a per-stream basis. ** Todd: SCR is fully transparent (uses today's API), while ZCR requires new client API. * Sanjay: Current policy is manual. Later policy can have system automatically cache hot files. Need the fallback buffer in case you are remote. ** Todd: high perf apps will always use the ZCR API. Sometimes it will fall back to a normal read, so no worse than today. ** Colin: should we have a flag that basically says always mmap? Can add it later, don't know how useful this could be. * Colin: no native support required for ZCR beyond what is there today. There are some libhdfs changes, but not completely required. Java has mmap today. ** We will need a native call for locking though. Centralized cache mgmt * Andrew gave whiteboard presentation ** DN has mlock hooks, ulimit conf of how much it can cache ** NN sends heartbeats to DN with cache/uncache commands on whole blocks ** DN will send cache state to NN similar to block reports ** clients call getFileBlockLocations() with storageType arg. This returns the current state of the cache. ** clients can issue CachingRequests, with a path that points to a file or directory. If directory, then what is cached is what's in that directory (but not recurse to subdirs), in order to support Hive. Can also specify user, pool for quote mgmt. Can also specify # cache copies (must be = replicationFactor). * Quotas ** Quotas are on pools, not users. Quotas enforced on the NN. ** Questions about what is cached as machines come and go? Use getFileBlockLocations() to get cache request and current status of fulfillment. Can be not fulfilled due to quota for instance. *** Should cache requests from two pools be counted fully against both? Half-half? Cluster capacity can be dynamic, so you always have potential quote mgmt problems. ** Don't want to get this so complicated so that you basically are implementing another scheduler just for cache quotas. ** Suresh: Resource failures - how does this affect the pools? Should we have priorities for pools? Priorities for individual CacheRequests? ** Andrew: suggestion of min/max/share (similar to VMware ESX VM memory configuration). ** Suresh: fine with doing something very basic, and then be more intelligent later. ** Sanjay: need to have some idea of per-pool priority to enforce min, to figure out what to evict from the cache first in mem-constrained scenarios. Also what happens once we have resources again? * Suresh:
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723306#comment-13723306 ] Suresh Srinivas commented on HDFS-4949: --- As discussed in the comments earlier, few of us are going to meet to discuss the design and issues related to this jira. I have setup a meeting at Hortonworks office. We should be able to host around 15-20 people. I already have [~andrew.wang], [~arpitgupta], [~atm], [~bikassaha], [~cmccabe], [~sanjay.radia], [~sureshms], [~vinodkv], [~jingzhao] and gopal as attendees. Others who want to attend the meeting or want to join over the phone, please reach out to me at sur...@hortonworks.com. We will post notes from the discussion to this jira. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712529#comment-13712529 ] Sanjay Radia commented on HDFS-4949: Caching partial blocks: There is no problem with a DN caching only the hot parts of a block and still declaring to the NN that the block is cached in ram. This would fit in with the proposal of abstracting ram copies as replicas. The use case that does not fit in is where DN1 has cached the first 100 bytes and and Datanode, DN2 has cached the last 100 bytes and you want the client to go to the right data node based on what portion of the file it is reading. If and when we finally get to caching portions and we want to support the use case mentioned, we, at that time, could considering the block-info sent for RAM replicas to indicate what portion are cached -- this would mean that certain replicas have additional in the block map. Given that we are not caching portions of block for this Jira and that for tiered storage for SSDs we want to add the device info to block location, I suggest that we proceed with abstracting RAM copies as replicas and later revisit this decision for partial block caching at a later point. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712579#comment-13712579 ] Colin Patrick McCabe commented on HDFS-4949: As Todd, Andrew, and I said before, all of the designs we considered that treated what was in the cache as replicas suffered from an inability to revoke the client's access to this memory. If you pass the client a file descriptor to a file in {{/dev/shm}}, you cannot revoke access to that later on. The client can hold on to that memory forever. That alone is enough to throw out that design. To avoid this, we have to use mmap of a file on disk. And when you do that, it can no longer be abstracted as a replica, because the on-disk copy has to exist. It is at best, a property of an existing replica. Just as important, caching decisions also have to be made on a different timescale than decisions about hierarchical storage management. HSM decisions can be made over the course of minutes or hours; caching decisions have to be made in seconds to be relevant. Memory is not a storage tier. It doesn't store anything; rather, it caches. Does it make sense to fsck memory? That is silly. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712856#comment-13712856 ] Sanjay Radia commented on HDFS-4949: bq. we have to use mmap of a file on disk. Please look at my comments: I have not objected to mmap and mlock. I am fine with having Ram replicas backed by disk replica; indeed I see this as an important advantage over Ramfs where the data is copied. The replication abstractions allows for a more general view where they are not, but our implementation restricts the memory replicas to be backed by disk replicas. bq. In general, tiered storage management happens over a longer period of time than cache management. The term tier-storage is unfortunate (I misused it in my original comment). In HDFS-2832, we consciously used the terms heterogeneous storage and not tiered storage. Tiering as in moving things based on their hotness is policy. (BTW I envision using SSDs initially not for moving hot blocks but as storage for *one* of 3 replicas. I have discussed this use case with a few of the HBase folks). Caching is a use case that applies well to disks vs ram. Both the use cases apply well to the abstraction of replicas stored on different kinds of storage devices. bq. Memory is not a storage tier. It doesn't store anything; rather, it caches. Does it make sense to fsck memory? That is silly. Memory and disks store data but one is way more durable. Fsck is a bad example - you do fsck on a file system not on the disk. Here we are taking about entities that store HDFS block data. But this debate over the similarities and difference between ram and disk is a longer one that we should have over beer. I am not blind to the differences between disks and ram. Further, by using the same abstraction to model ram copies and disk copies does not mean that I am implying that I am going to always treat them as exactly the same and ignore the differences. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713117#comment-13713117 ] Sanjay Radia commented on HDFS-4949: To converge on this could we do a meetup? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713327#comment-13713327 ] Colin Patrick McCabe commented on HDFS-4949: A meetup is a good idea. I will be at OSCON next week on Tuesday, Wednesday, and Thursday, but any other time in the next two weeks is fine with me. I can't speak for Andrew and Todd, but I didn't see anything on the calendar that would block it in that time frame. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709911#comment-13709911 ] Chu Tong commented on HDFS-4949: To maximize memory usage, should we consider to compress file blocks before caching them in memory? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710097#comment-13710097 ] Colin Patrick McCabe commented on HDFS-4949: For most of the applications we're considering here, compression would not be a win, because it is CPU-intensive. It also would involve copying the data in memory, which is one of the things we're trying to avoid here. I think it will be more effective to use something like CompressionCodec, ORC, Parquet, rcfile, etc. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707288#comment-13707288 ] Sanjay Radia commented on HDFS-4949: I think we can treat the RAM copies as replicas - this fits into the generalized tiered-storage architecture as described in HDFS-2832 (Ram, flash, disk.) * Block reports will indicate the storage type. * NN will store the storage type in the Block map * Block locations returned by NN to client will have storage type (i.e. don't need the IsCached flag). ** NN will order the replicas locations based on closeness and speed; this will mean that the client side will automatically go to the best place (although we can have a smarter client do something different if desired.) * NN will not count Ram replicas towards the normal replica count - this is one area where the ram replicas are treated differently. * This can support a usage model where the ram replicas are at each or only some of the disk replica locations. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707389#comment-13707389 ] Colin Patrick McCabe commented on HDFS-4949: I agree that there are some commonalities between the hierarchical storage management work and what we're doing here. However, tiered storage management schemes put an entire block into a tier. This is different than what we want to (eventually) do with caching, which is cache only part of a block. If we end up with a heavyweight scheme where the entire block has to be loaded into memory before any of it can be accessed, this may actually cause a performance regression, not an improvement. In general, tiered storage management happens over a longer period of time than cache management. We need to be responsive to changes that happen in just a few seconds. In contrast, moving things from (say) hard disk to SSD and back will happen over minutes or hours. The same code is not going to be able to handle both well. The proposed implementations are quite different, as well. HSM will involve copying block files between local FS directories. Cache management will involve mlock'ing block files and passing the file descriptors to clients. You might well ask, why not simply copy the block file to /dev/shm for your implementation? However, this has the all or nothing problem described above (can't cache a partial block this way). It also has a more subtle problem with what we are calling revocation. Basically, a misbehaving client which holds an open file descriptor in /dev/shm can continue to use memory indefinitely-- there is no way the DataNode can ever revoke that memory. This problem does not exist with the mlock solution which we have outlined here. So while I think we should consider the possibility of sharing code as the two projects progress, I don't want to make this a subtask of that project. There are just too many differences in goals and approaches for it to make sense. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707467#comment-13707467 ] Andrew Wang commented on HDFS-4949: --- Sanjay, thanks for your comments! I need to look more at HDFS-2832, but I think we've got some nice overlap. Particularly, I agree that cache would be just another DN Storage. bq. Block reports will indicate the storage type. I'm ok with this, but our initial design proposes separate heartbeats since cache reports might want to tick on a different interval. You might want quick cache reports if datanodes are doing their own LRU, but maybe you'd want to adaptively throttle it back if the NN is under load, since cache report processing can be expensive. Separate heartbeats per-storage could definitely be added in later, so consider this a later-stage optimization. bq. NN will order the replicas locations based on closeness and speed; This is tricky because it depends on the network topology and workload. I don't want my single cached replica to get hammered by the entire cluster, but perhaps going to in-rack memory is better than local disk. I figure clients should be able to provide a configurable policy to their DFSClients. I think we also still need the isCached flag for scheduling. Hypothetically, MR might always want to place on a memory replica over a disk replica. So, we could sort memory replicas first, then disk replicas. However, this squishes the existing ordering based on network topology used by DFSClients, and all our DFSClients end up hammering the cached replica at read time. Note that even without a smarter DFSClient, we can get a lot of benefit just by making schedulers place tasks for memory-locality since our big win is going to be local memory reads. Colin's working on this in HDFS-4952. bq. NN will not count Ram replicas towards the normal replica count - this is one area where the ram replicas are treated differently. bq. This can support a usage model where the ram replicas are at each or only some of the disk replica locations. +1, let's design for a future where cache might not be disk-backed. As Colin notes above, memory HSM is not easy, but the code should be flexible. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707536#comment-13707536 ] Suresh Srinivas commented on HDFS-4949: --- bq. However, tiered storage management schemes put an entire block into a tier. This is different than what we want to (eventually) do with caching, which is cache only part of a block. If we end up with a heavyweight scheme where the entire block has to be loaded into memory before any of it can be accessed, this may actually cause a performance regression, not an improvement. This is one point I have been mulling over as well. I agree, partially cached data does not fit with storage hierarchy. But I was not sure partial cache is being consider in the first phase of implementation in this jira (sorry I am yet to read this carefully, If i missed details). One other thing that we are considering is usage based extra replica to memory tier. Need to make some time to get that all into a doc. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707546#comment-13707546 ] Todd Lipcon commented on HDFS-4949: --- Hey folks. I agree that HSM is a much bigger task than what we're talking about here, and not certain they can fit into the same framework. During our early internal design discussions I'd suggested the same thing, but after an hour or two of throwing the idea around, we discounted it due to the reasons Colin mentioned above (partial caching and revocation). Though partial caching isn't referenced in the doc, it's a straightforward extension that we plan to tackle down the road. For example, we can take each block, subdivide into 1MB chunks, and then report a bitmap indicating which chunks are cached. Taking advantage of the kernel lets us do this relatively easily calling mlock/munlock -- and the revocation problem is again simple because a misbehaving client won't be able to pin memory. I don't think this work precludes later work on the idea of memory-only storages/replicas. That has other advantages, particularly on the *write* side for temporary data, etc. But is somewhat tricky to get right. When we do that, we should certainly look at it in a generalized way (RAM, SSD, Disk as a hierarchy). Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707572#comment-13707572 ] Suresh Srinivas commented on HDFS-4949: --- The problem for partical cache is though, while at file level or directory level it is easy to address the data to be cached, more granular level means that the cache management layer (hopefully I would like see it in namenode) will cause bloat in how much data must be tracked in cache manager. How about making use of access patterns and build it instead of managed cache? Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707576#comment-13707576 ] Todd Lipcon commented on HDFS-4949: --- Yes, I think the explicit caching policies would operate mostly on a whole-file granularity. The DNs, though, can do local LRU tracking of sub-blocks. This is useful for cases like querying a Parquet file where a single file is made up of subranges corresponding to different columns. Some columns may be hot while others are cold, and if the DN can notice this and start offering zero-copy for the hot columns, it will be a significant performance win. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira