[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Fix Version/s: 2.3.0 > Centralized cache management in HDFS > > > Key: HDFS-4949 > URL: https://issues.apache.org/jira/browse/HDFS-4949 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: 3.0.0, 2.3.0 >Reporter: Andrew Wang >Assignee: Andrew Wang > Fix For: 2.3.0 > > Attachments: HDFS-4949-consolidated.patch, > caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, > caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, > hdfs-4949-branch-2.patch > > > HDFS currently has no support for managing or exposing in-memory caches at > datanodes. This makes it harder for higher level application frameworks like > Hive, Pig, and Impala to effectively use cluster memory, because they cannot > explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linhaiqiang updated HDFS-4949: -- Fix Version/s: (was: 2.3.0) > Centralized cache management in HDFS > > > Key: HDFS-4949 > URL: https://issues.apache.org/jira/browse/HDFS-4949 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Affects Versions: 3.0.0, 2.3.0 >Reporter: Andrew Wang >Assignee: Andrew Wang > Attachments: HDFS-4949-consolidated.patch, > caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, > caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, > hdfs-4949-branch-2.patch > > > HDFS currently has no support for managing or exposing in-memory caches at > datanodes. This makes it harder for higher level application frameworks like > Hive, Pig, and Impala to effectively use cluster memory, because they cannot > explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Resolution: Fixed Fix Version/s: 2.4.0 Status: Resolved (was: Patch Available) I went through and resolved or pushed out all remaining subtasks. With the code in branch-2, we can resolve this parent issue. Thanks for all the contributions from everyone involved! Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Fix For: 2.4.0 Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Attachment: hdfs-4949-branch-2.patch Attached is a consolidated patch for branch-2. Unfortunately we left the HDFS-4949 branch fallow while development continued in trunk, but I did my best to squash all of the caching-related patches committed thus far into this mega patch. A preliminary test run of HDFS and Common looked good, but I'm running another right now on this version of the patch to verify. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.4.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: HDFS-4949-consolidated.patch, caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, hdfs-4949-branch-2.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Status: Patch Available (was: Open) Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Attachment: HDFS-4949-consolidated.patch Consolidated patch attached to get a Jenkins run. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-4949: --- Attachment: caching-design-doc-2013-10-24.pdf updated design doc. Revisions: * change future tense to present tense in some cases. * grammar corrections * update to reflect the fact that caching information is stored in {{LocatedBlocks}} rather than {{BlockLocation}} * move cache expiry feature to future work * remove part about pools being in a configuration file (they are stored in the edit log) * rework API documentation to match current API Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Attachment: (was: HDFS-4949-consolidated.patch) Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Attachment: HDFS-4949-consolidated.patch New patch attached. The RAT error is spurious due to the CHANGES-HDFS-4949.txt file. I also want to fix the edits unit test after we merge, since historically checking in the new binary file has been tricky to get right via patch. We've gotten clean unit test runs on upstream Jenkins, so I have confidence that it's correct. Finally, the javac warnings are also intentional, related to using the internal unmap APIs. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, HDFS-4949-consolidated.patch HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Chu updated HDFS-4949: -- Attachment: caching-testplan.pdf Attaching the test plan for this feature (caching-testplan.pdf). Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, caching-testplan.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Attachment: caching-design-doc-2013-08-09.pdf Suresh, thanks for posting your notes. Attached is a revised design doc that beefs up the resource management / user quotas section, as well as addressing your other smaller points. As a meta-point, I think much of the remaining resource management design can wait until after we get the initial end-to-end implementation going. I think it's reasonable for the first iteration to do something simple like superuser only or user quotas, then we layer on the complexities of pools, priorities, ACLs, min/max/share, and failure cases afterwards. It's good to get the API roughly right so we code with foresight, but I don't see us getting around to implementing pools for at least a month or two. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.3.0 Reporter: Andrew Wang Assignee: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4949) Centralized cache management in HDFS
[ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-4949: -- Attachment: caching-design-doc-2013-07-02.pdf Here's a design doc that we've been working on internally. It proposes adding off-heap caches to each datanode using mmap and mlock, managed centrally by the NameNode. Any feedback welcomed. I'm hoping we can have a fruitful design discussion on this JIRA, then perhaps get a branch and start development. Centralized cache management in HDFS Key: HDFS-4949 URL: https://issues.apache.org/jira/browse/HDFS-4949 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Andrew Wang Attachments: caching-design-doc-2013-07-02.pdf HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig, and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira