[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

Andrew Wang (JIRA) Fri, 09 Aug 2013 20:59:29 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13735723#comment-13735723
 ]


Andrew Wang commented on HDFS-4949:
-----------------------------------

Hey Arun, thanks for taking a look!

Tying in YARN would definitely be great. There's half a hope that we can jump 
right from a prototype naive scheme to using YARN directly, but our resource 
management team doesn't have time in the near term to make this happen. I 
definitely want our abstractions to be as similar as possible though to ease a 
future transition; your input there is appreciated.

As to your other points:

1. The main reason we added auto-caching of new files was actually for Hive. My 
understanding is that Hive users can drop new files into a Hive partition 
directory without notifying the Hive metastore, e.g. via the fs shell. Since 
we'd like to provide the abstraction of caching higher-level abstractions like 
Hive partitions or tables, this auto-caching is necessary.
2. We were planning on extending the existing getFileBlockLocations API (which 
takes a Path, offset, and length) to also indicate which replicas of the 
returned blocks are cached. This should satisfy the needs of framework 
schedulers like MR or Impala. At read time, we'll also provide per-stream 
statistics of the number of bytes read remotely vs. local disk vs. local 
memory. Remote memory reads are also on our mind, but will likely be a 
per-stream or per-client config option added later.

Suresh, to partially address your questions, Colin's going to put pools into 
the patch at HDFS-5052, and he's also been working on buffer-oriented access at 
HDFS-4953. Thanks for your comments on the subtasks thus far.
                
> Centralized cache management in HDFS
> ------------------------------------
>
>                 Key: HDFS-4949
>                 URL: https://issues.apache.org/jira/browse/HDFS-4949
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: 3.0.0, 2.3.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: caching-design-doc-2013-07-02.pdf, 
> caching-design-doc-2013-08-09.pdf
>
>
> HDFS currently has no support for managing or exposing in-memory caches at 
> datanodes. This makes it harder for higher level application frameworks like 
> Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
> explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

Reply via email to