[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881830#comment-13881830
 ] 

Hudson commented on HDFS-4949:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #461 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/461/])
Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. 
(wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Fix For: 2.4.0

 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881857#comment-13881857
 ] 

Hudson commented on HDFS-4949:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1678 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1678/])
Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. 
(wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Fix For: 2.4.0

 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881875#comment-13881875
 ] 

Hudson commented on HDFS-4949:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1653 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1653/])
Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. 
(wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Fix For: 2.4.0

 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-22 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879143#comment-13879143
 ] 

Colin Patrick McCabe commented on HDFS-4949:


branch-2 patch looks good to me;  thanks Andrew

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-22 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879198#comment-13879198
 ] 

Andrew Wang commented on HDFS-4949:
---

Thanks Colin. The branch-2 test run I did with this also came back clean, so I 
think it's good to go. Will commit shortly.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-22 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879235#comment-13879235
 ] 

Andrew Wang commented on HDFS-4949:
---

I've committed this to branch-2 and fixed up the CHANGES.txt in branch-2 and 
trunk accordingly. It might finally be time to resolve this parent issue, and 
punt all remaining subtasks out into their own standalone issues.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879301#comment-13879301
 ] 

Hudson commented on HDFS-4949:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5035 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5035/])
Move HDFS-4949 subtasks in CHANGES.txt to a new section under 2.4.0 release. 
(wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1560528)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Fix For: 2.4.0

 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2014-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877995#comment-13877995
 ] 

Hadoop QA commented on HDFS-4949:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12624211/hdfs-4949-branch-2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5930//console

This message is automatically generated.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf, 
 hdfs-4949-branch-2.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-12-23 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855882#comment-13855882
 ] 

Arun C Murthy commented on HDFS-4949:
-

We've been discussing through a proposal where-in we can leverage YARN's 
resource-management *and* workload-management capabilities (via delegation of 
resources, in this case RAM to HDFS) to provide a more general cache 
administration in YARN-1488.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-12-03 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838467#comment-13838467
 ] 

Fengdong Yu commented on HDFS-4949:
---

[~cnauroth], yes, I can find CacheManager in the trunk, but it's not consistent 
with HDFS-4949 branch.

Does that always committed in HDFS-4949 even after merged with trunk?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-12-03 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838471#comment-13838471
 ] 

Andrew Wang commented on HDFS-4949:
---

Hey [~azuryy], since the merge, we've been committing caching-related patches 
just to trunk. The branch is now defunct. We're still using this JIRA for 
tracking ongoing subtasks though, and some JIRAs also have the caching label.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-12-03 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838481#comment-13838481
 ] 

Fengdong Yu commented on HDFS-4949:
---

Thanks, Andrew.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-12-02 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836687#comment-13836687
 ] 

Chris Nauroth commented on HDFS-4949:
-

Hi, [~azuryy].  The merge vote passed and HDFS-4949 was merged to trunk about a 
month ago.  For example, here you can see the {{CacheManager}} class on trunk:

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/CacheManager.java


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-12-01 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836321#comment-13836321
 ] 

Fengdong Yu commented on HDFS-4949:
---

Vote thread has been started for a month, but HDFS-4949 branch still not merged 
with trunk, does that has any blockers here?



 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: HDFS-4949-consolidated.patch, 
 caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf, 
 caching-design-doc-2013-10-24.pdf, caching-testplan.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807867#comment-13807867
 ] 

Hudson commented on HDFS-4949:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #377 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/377/])
Merge HDFS-4949 branch back into trunk (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572)
* /hadoop/common/trunk
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/PathBasedCacheDescriptor.java
* 

[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807887#comment-13807887
 ] 

Hudson commented on HDFS-4949:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1567 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1567/])
Merge HDFS-4949 branch back into trunk (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572)
* /hadoop/common/trunk
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/PathBasedCacheDescriptor.java
* 

[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807952#comment-13807952
 ] 

Hudson commented on HDFS-4949:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1593 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1593/])
Merge HDFS-4949 branch back into trunk (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572)
* /hadoop/common/trunk
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java
* 

[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-28 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807135#comment-13807135
 ] 

Chris Nauroth commented on HDFS-4949:
-

+1 for the merge.  Thanks again, Andrew and Colin.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807547#comment-13807547
 ] 

Hudson commented on HDFS-4949:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4664 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4664/])
Merge HDFS-4949 branch back into trunk (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1536572)
* /hadoop/common/trunk
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BatchedRemoteIterator.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ByteBufferUtil.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/HasEnhancedByteBufferAccess.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ReadOption.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ZeroCopyUnavailableException.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/ElasticByteBufferPool.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/IntrusiveCollection.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightCache.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LightWeightGSet.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/nativeio/TestNativeIO.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestIdentityHashStore.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestLightWeightGSet.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmap.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/ClientMmapManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/AddPathBasedCacheDirectiveException.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/CachePoolInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/DatanodeInfo.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/PathBasedCacheDescriptor.java
* 

[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805435#comment-13805435
 ] 

Suresh Srinivas commented on HDFS-4949:
---

Given that the merge vote thread has been started, can someone post details 
about what functionality from original design has been completed and what is 
pending? Looks like the current functionality does not cover quota management. 
Any other features pending?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-25 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805463#comment-13805463
 ] 

Chris Nauroth commented on HDFS-4949:
-

Here is a list of items discussed in the design doc to be completed later, with 
corresponding jira if it exists:
* quota enforcement - need to file jira?
* cache expiry based on TTL - need to file jira?
* incremental cache reports - HDFS-5092
* metrics - HDFS-5320


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-25 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805791#comment-13805791
 ] 

Colin Patrick McCabe commented on HDFS-4949:


quick note: The reason why TestOfflineEditsViewer failed in jenkins is that the 
consolidated patch didn't change the binary edit log file used by that test.  
It succeeds on the branch.

bq. incremental cache reports - HDFS-5092

This is listed as a maybe in the design doc-- it's something that we want to 
evaluate before doing

quotas, metrics, and TTL are definitely post-merge, I think.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804816#comment-13804816
 ] 

Hadoop QA commented on HDFS-4949:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12610186/caching-design-doc-2013-10-24.pdf
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5272//console

This message is automatically generated.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-24 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804828#comment-13804828
 ] 

Chris Nauroth commented on HDFS-4949:
-

Thank you, Colin and Stephen.  The design doc and test plan LGTM.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804840#comment-13804840
 ] 

Hadoop QA commented on HDFS-4949:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12610166/HDFS-4949-consolidated.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 29 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1551 javac 
compiler warnings (more than the trunk's current 1548 warnings).

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 5 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 8 new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.web.TestJsonUtil
  
org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5270//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5270//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5270//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5270//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5270//console

This message is automatically generated.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-10-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805021#comment-13805021
 ] 

Hadoop QA commented on HDFS-4949:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12610221/HDFS-4949-consolidated.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 30 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1551 javac 
compiler warnings (more than the trunk's current 1548 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5273//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5273//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5273//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5273//console

This message is automatically generated.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf, caching-design-doc-2013-10-24.pdf, 
 caching-testplan.pdf, HDFS-4949-consolidated.patch


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-14 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740145#comment-13740145
 ] 

Andrew Wang commented on HDFS-4949:
---

Hi Arun,

On the read path comments, it might be elucidating to check out the zero-copy 
read API that Colin's working on at HDFS-4953. The idea is that clients always 
use the zero copy cursor to do reads, which behind the scenes will do an mmap'd 
read if the block is cached, or a normal copying read if the block is on disk 
or remote. It allows an {{isCached}}-type check via not setting a fallback 
buffer for copying reads. This will cause the cursor to throw an exception on 
read if the block is not cached. Finally, there's also a parameter for enabling 
short reads, which comes into play when a read spans block files.

On YARN integration, I'd like to revisit that a little ways down the road since 
we're focusing on getting a basic prototype out. If you want to get started on 
it now, it'd be helpful if you could review the current RM plan in the doc, and 
sketch out how a YARN-based architecture would look.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736019#comment-13736019
 ] 

Arun C Murthy commented on HDFS-4949:
-

bq. 1. The main reason we added auto-caching of new files was actually for 
Hive. My understanding is that Hive users can drop new files into a Hive 
partition directory without notifying the Hive metastore, e.g. via the fs 
shell. 

Usually partitions in Hive are new directories. So every 5 or 10 or 15 mins a 
new directory is added along with new data. Hence, the ability to automatically 
cache new files seems redundant.

bq. 2. We were planning on extending the existing getFileBlockLocations API 
(which takes a Path, offset, and length) to also indicate which replicas of the 
returned blocks are cached. This should satisfy the needs of framework 
schedulers like MR or Impala. 

[~andrew.wang] Agree that the enhancement to getFileBlockLocations suffices for 
the scheduler. However, at read time it will be very useful to get an indicator 
on whether it's cached or not during open. The RecordReader needs this API to 
decide whether to do stream-based reads (when data isn't cached in RAM) or mmap 
the file (when it's cached). It would be unfortunate to have to do another call 
to getFileBlockLocations to validate during read time.

For e.g. SequenceFileRecordReader.initialize would look something like:

{code:title=SequenceFileRecordReader.java}
  public void initialize(InputSplit split, 
 TaskAttemptContext context
 ) throws IOException, InterruptedException {

  // ...

  StreamOrCached splitData = split.getPath().open(fileSplit.getStart(), 
fileSplit.getLength();
  InputStream in = null;
  if (in.isCached()) {
in = new ByteBufferInputStream(splitData.getByteBuffer());
  } else {
in = splitData.getFSDataInputStream();
  }
  
  // Now use in
  // ...
  
{code}

So, having the open api which returns something like StreamOrCached will be 
useful as sketched above.

Open to other ideas, but hopefully I put across what I'm looking for.

Thoughts?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736020#comment-13736020
 ] 

Arun C Murthy commented on HDFS-4949:
-

bq. Tying in YARN would definitely be great. There's half a hope that we can 
jump right from a prototype naive scheme to using YARN directly ...

I'm happy to help to get that done, let's discuss more.

Agree that having right abstractions is important.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-09 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734926#comment-13734926
 ] 

Andrew Wang commented on HDFS-4949:
---

Hi Tsuyoshi,

HDFS-4953 allows applications to do zero-copy reads, so when combined with this 
JIRA, HDFS will be able to provide full memory-bandwidth reads on cached data. 
Deserialization is a somewhat separate concern since it happens at the 
application-level though. If an app can operate directly on the raw bytes in a 
file (e.g. a ByteBuffer), then it can avoid deserialization overhead. IIUC, 
this is untrue of the current MR input formats.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-09 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735382#comment-13735382
 ] 

Suresh Srinivas commented on HDFS-4949:
---

bq. As a meta-point, I think much of the remaining resource management design 
can wait until after we get the initial end-to-end implementation going.
+1 for this. There are many loose ends to be tied and details to be figured out 
in the design. But the basic implementation could start right away.

Some things that we should get to sooner than later:
- Pool abstraction and making sure all the APIs are using them (including cache 
creation and deletion)
- Some details related to how the stream oriented APIs change to buffer 
oriented access.

The real quota management, counting common cached data to different pools etc. 
can be revisited later. Will take a look at the updated doc soon. Thanks Andrew.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-09 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735632#comment-13735632
 ] 

Arun C Murthy commented on HDFS-4949:
-

bq. As a meta-point, I think much of the remaining resource management design 
can wait until after we get the initial end-to-end implementation going. 

Makes sense.

I, for one, would volunteer to help you guys do resource-management directly 
via YARN rather than go the route of inventing half of YARN RM within HDFS. It 
would benefit both HDFS (simpler, plus ability to use memory dynamically 
between applications and for caching)  YARN (more robust for a diverse set of 
applications). Any takers? Thanks.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-09 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735643#comment-13735643
 ] 

Arun C Murthy commented on HDFS-4949:
-

[~andrew.wang] overall it's looks great, some more questions:

# I'm not sure you want to automatically add new files in a directory to the 
cache, it seems a higher-level system (Hive, Impala, HCat) are in better 
position. Not doing this automatically simplifies cache mgmt, quota mgmt etc.
# Can you please provide details on the read apis? For the Hive/MR/Pig use case 
I'd like to see a new open(Path, offset, length) which returns an indicator for 
whether the block is cached or not. This, for e.g., would be used by the 
RecordReader to read the split.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-09 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735723#comment-13735723
 ] 

Andrew Wang commented on HDFS-4949:
---

Hey Arun, thanks for taking a look!

Tying in YARN would definitely be great. There's half a hope that we can jump 
right from a prototype naive scheme to using YARN directly, but our resource 
management team doesn't have time in the near term to make this happen. I 
definitely want our abstractions to be as similar as possible though to ease a 
future transition; your input there is appreciated.

As to your other points:

1. The main reason we added auto-caching of new files was actually for Hive. My 
understanding is that Hive users can drop new files into a Hive partition 
directory without notifying the Hive metastore, e.g. via the fs shell. Since 
we'd like to provide the abstraction of caching higher-level abstractions like 
Hive partitions or tables, this auto-caching is necessary.
2. We were planning on extending the existing getFileBlockLocations API (which 
takes a Path, offset, and length) to also indicate which replicas of the 
returned blocks are cached. This should satisfy the needs of framework 
schedulers like MR or Impala. At read time, we'll also provide per-stream 
statistics of the number of bytes read remotely vs. local disk vs. local 
memory. Remote memory reads are also on our mind, but will likely be a 
per-stream or per-client config option added later.

Suresh, to partially address your questions, Colin's going to put pools into 
the patch at HDFS-5052, and he's also been working on buffer-oriented access at 
HDFS-4953. Thanks for your comments on the subtasks thus far.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf, 
 caching-design-doc-2013-08-09.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731077#comment-13731077
 ] 

Colin Patrick McCabe commented on HDFS-4949:


I created a branch for this (HDFS-4949)

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-06 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731143#comment-13731143
 ] 

Suresh Srinivas commented on HDFS-4949:
---

My notes from the meeting:
Enabling this feature on windows platform requires the following:
# Need Unix Domain sockets equivalent
# mmap and munmap is done using java and should not require any windows 
specific changes
# mlock there is no windows equivalent?

Quota for datanode cache is counted against pool

Design needs to cover the following scenarios in more detail:
# Two pools caching the same file and how is quota counted
# Resource failures and how it affects existing caches for the pools. Perhaps 
pools should have priorities.
#* scenario 1 - resource failure takes down cached data. In the first cut, no 
new cached replicas will be created.
#* scenario 2 - resources failed and cluster capacity is low, then the 
application even if higher priority will not get cache quota.
# Caching supported for whole file for now.
# Only completed blocks will be cached. This is true for files that are being 
written.
# symlink paths will not be cached
# Need to add more details on enabling cache for a directory and how the newly 
created files (on completion of write) will be added to the cache. This also 
has quota implications and need for handling failures related to either 
reaching quota or non-availability of resources for such automatic caching to 
work.

We should add TTL for caching request and expire the cache.

I think we should refresh the design document based on discussions from the 
discussions.


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-06 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731158#comment-13731158
 ] 

Chris Nauroth commented on HDFS-4949:
-

bq. mlock there is no windows equivalent?

I believe the Windows equivalent of {{mlock}} and {{munlock}} are 
{{VirtualLock}} and {{VirtualUnlock}}.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366910(v=vs.85).aspx


 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-08-06 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731632#comment-13731632
 ] 

Tsuyoshi OZAWA commented on HDFS-4949:
--

Hi,

Are there any plan to add APIs to access this caching layer directly from 
processing framework - e.g. MR? RDD(Spark) paper says that 
serialization/deserialization of file contents can be the bottleneck of 
processing. If we have such general APIs using mlock/munlock, it can reduce 
processing time drastically. Or, this idea is out of scope of this JIRA?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-31 Thread David S. Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13725704#comment-13725704
 ] 

David S. Wang commented on HDFS-4949:
-

HDFS-4949 meeting, July 29, 2013 2 PM @ Hortonworks office


Attendees


Aaron T. Myers
Andrew Wang
Arpit Gupta
Bikas Saha
Brandon Li
Colin McCabe
Dave Wang
Jing Zhao
Suresh Srinivas
Sanjay Radia
Todd Lipcon
Vinod Kumar Vavilapalli

Minutes


* General agreement to hold HDFS-2832 meeting some other day

* Andrew: Posted HDFS-4949 design doc upstream; Sanjay has read this, agrees 
with the goals

Data path (zero-copy reads)


* Sanjay: quota mgmt - counted up front, not after cache is populated

* Colin: talking about ZCR (mmap) - used to implement caching at the DN level
** Considered copying everything into /dev/shm (e.g. Tachyon).  But cannot 
cache parts of a file, so limits our flexibility.  Also, the associated fd 
gives clients a way to control memory mgmt (will not release until that 
descriptor is closed), which is not good because of buggy clients etc.
** Sanjay: you want an abstraction for a durable file.  Colin: yes.
** Colin: ZCR currently doesn't have checksums, but will.  Todd: assumption is 
that DN will do the cksum when doing the mlock and communicate that to the 
client so the client knows that it's safe to read.
** Todd: mincore() can tell you what's already in cache, but it's too granular, 
 very expensive to call, and can be out-of-date immediately.
** Assuming this is for local clients only obviously.
** ZCR uses ByteBuffers to avoid copies.  Not entirely compatible with current 
DFSClient since that uses byte arrays, so you cannot avoid copies.

* This may have a conflict with inline checksums.  Clients would have to be 
aware of how to skip over checksums, and this would have to be in the app, not 
the client since we're talking mmap.
** HBase gets around this by disabling HDFS-level cksums, and doing it on their 
level.
** Sanjay: QFS puts all of the cksums in the beginning of the file
** Todd: Liang Xie had an HBase study where he figured out that perf didn't 
improve until he got to a TB of data, when the cksum files themselves dropped 
out ofcache.

* ZCR API can be made public?  Colin, Todd: Yes.
** Hard to compete with Spark if this isn't public.
** Suresh: Will the app know if you got ZCR?  Can be added as counters.  Colin: 
already have similar concepts today for SCR on a per-stream basis.
** Todd: SCR is fully transparent (uses today's API), while ZCR requires new 
client API.

* Sanjay: Current policy is manual.  Later policy can have system automatically 
cache hot files.  Need the fallback buffer in case you are remote.
** Todd: high perf apps will always use the ZCR API.  Sometimes it will fall 
back to a normal read, so no worse than today.
** Colin: should we have a flag that basically says always mmap?  Can add it 
later, don't know how useful this could be.

* Colin: no native support required for ZCR beyond what is there today.  There 
are some libhdfs changes, but not completely required.  Java has mmap today.
** We will need a native call for locking though.

Centralized cache mgmt


* Andrew gave whiteboard presentation
** DN has mlock hooks, ulimit conf of how much it can cache
** NN sends heartbeats to DN with cache/uncache commands on whole blocks
** DN will send cache state to NN similar to block reports
** clients call getFileBlockLocations() with storageType arg.  This returns the 
current state of the cache.
** clients can issue CachingRequests, with a path that points to a file or 
directory.  If directory, then what is cached is what's in that directory (but 
not recurse to subdirs), in order to support Hive.  Can also specify user, pool 
for quote mgmt.  Can also specify # cache copies (must be = replicationFactor).

* Quotas
** Quotas are on pools, not users.  Quotas enforced on the NN.
** Questions about what is cached as machines come and go?  Use 
getFileBlockLocations() to get cache request and current status of fulfillment. 
 Can be not fulfilled due to quota for instance.
*** Should cache requests from two pools be counted fully against both?  
Half-half?  Cluster capacity can be dynamic, so you always have potential quote 
mgmt problems.
** Don't want to get this so complicated so that you basically are implementing 
another scheduler just for cache quotas.
** Suresh: Resource failures - how does this affect the pools?  Should we have 
priorities for pools?  Priorities for individual CacheRequests?
** Andrew: suggestion of min/max/share (similar to VMware ESX VM memory 
configuration).
** Suresh: fine with doing something very basic, and then be more intelligent 
later.
** Sanjay: need to have some idea of per-pool priority to enforce min, to 
figure out what to evict from the cache first in mem-constrained scenarios.  
Also what happens once we have resources again?

* Suresh: 

[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-29 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723306#comment-13723306
 ] 

Suresh Srinivas commented on HDFS-4949:
---

As discussed in the comments earlier, few of us are going to meet to discuss 
the design and issues related to this jira. I have setup a meeting at 
Hortonworks office. We should be able to host around 15-20 people. I already 
have [~andrew.wang], [~arpitgupta], [~atm], [~bikassaha], [~cmccabe], 
[~sanjay.radia], [~sureshms], [~vinodkv], [~jingzhao] and gopal as attendees. 
Others who want to attend the meeting or want to join over the phone, please 
reach out to me at sur...@hortonworks.com.

We will post notes from the discussion to this jira.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-18 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712529#comment-13712529
 ] 

Sanjay Radia commented on HDFS-4949:


Caching partial blocks: There is no problem with a DN caching only the hot 
parts of a block and still declaring to the NN that the block is cached in ram. 
This would fit in with the proposal of abstracting ram copies as replicas. The 
use case that does not fit in is where DN1 has cached the first 100 bytes and 
and Datanode, DN2 has cached the last 100 bytes and you want the client to go 
to the right data node based on what portion of the file it is reading. If and 
when we finally get to caching portions and we want to support the use case 
mentioned, we, at that time, could considering  the block-info sent for RAM 
replicas to indicate what portion are cached -- this would mean that certain 
replicas have additional in the block map.

Given that we are not caching portions of block for this Jira and that for 
tiered storage for SSDs we want to add the device info to block location, I 
suggest that we proceed with abstracting RAM copies as replicas and later 
revisit this decision for partial block caching at a later point.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-18 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712579#comment-13712579
 ] 

Colin Patrick McCabe commented on HDFS-4949:


As Todd, Andrew, and I said before, all of the designs we considered that 
treated what was in the cache as replicas suffered from an inability to revoke 
the client's access to this memory.  If you pass the client a file descriptor 
to a file in {{/dev/shm}}, you cannot revoke access to that later on.  The 
client can hold on to that memory forever.  That alone is enough to throw out 
that design.

To avoid this, we have to use mmap of a file on disk.  And when you do that, it 
can no longer be abstracted as a replica, because the on-disk copy has to 
exist.  It is at best, a property of an existing replica.

Just as important, caching decisions also have to be made on a different 
timescale than decisions about hierarchical storage management.  HSM decisions 
can be made over the course of minutes or hours; caching decisions have to be 
made in seconds to be relevant.

Memory is not a storage tier.  It doesn't store anything; rather, it caches.  
Does it make sense to fsck memory?  That is silly.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-18 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712856#comment-13712856
 ] 

Sanjay Radia commented on HDFS-4949:


bq.  we have to use mmap of a file on disk.
Please look at my comments:  I have not objected to mmap and mlock.
I am fine with having Ram replicas backed by disk replica; indeed I see this as 
an important advantage over Ramfs where the data is copied. The replication 
abstractions allows for a more general view where they are not, but our 
implementation restricts the memory replicas to be backed by disk replicas.


bq. In general, tiered storage management happens over a longer period of time 
than cache management.
The term tier-storage is unfortunate (I misused it in my original comment). In 
HDFS-2832, we consciously  used  the terms heterogeneous storage and not 
tiered storage. Tiering as in moving things based on their hotness is policy. 
(BTW I envision using SSDs initially not for moving hot blocks but as storage 
for *one* of 3 replicas. I have discussed this use case with a few of the HBase 
folks). Caching is a use case that applies well to disks vs ram. Both the use 
cases apply well to the abstraction of replicas stored on different kinds of 
storage devices. 

bq. Memory is not a storage tier. It doesn't store anything; rather, it caches. 
Does it make sense to fsck memory? That is silly.
Memory and disks store data but one is way more durable. Fsck is a bad example 
- you do fsck on a file system not on the disk. Here we are taking about 
entities that store HDFS block data.  But this debate over the similarities and 
difference between ram and disk is a longer one that we should have over beer. 
I am not blind to the differences between disks and ram. Further, by using the 
same abstraction to model ram copies and disk copies does not mean that I am 
implying that I am going to always treat them as exactly the same and ignore 
the differences. 

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-18 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713117#comment-13713117
 ] 

Sanjay Radia commented on HDFS-4949:


To converge on this could we do a meetup?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-18 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713327#comment-13713327
 ] 

Colin Patrick McCabe commented on HDFS-4949:


A meetup is a good idea.  I will be at OSCON next week on Tuesday, Wednesday, 
and Thursday, but any other time in the next two weeks is fine with me.  I 
can't speak for Andrew and Todd, but I didn't see anything on the calendar that 
would block it in that time frame.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-16 Thread Chu Tong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709911#comment-13709911
 ] 

Chu Tong commented on HDFS-4949:


To maximize memory usage, should we consider to compress file blocks before 
caching them in memory?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710097#comment-13710097
 ] 

Colin Patrick McCabe commented on HDFS-4949:


For most of the applications we're considering here, compression would not be a 
win, because it is CPU-intensive.  It also would involve copying the data in 
memory, which is one of the things we're trying to avoid here.  I think it will 
be more effective to use something like CompressionCodec, ORC, Parquet, rcfile, 
etc.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707288#comment-13707288
 ] 

Sanjay Radia commented on HDFS-4949:


I think we can treat the RAM copies as replicas - this fits into the 
generalized tiered-storage architecture as described in HDFS-2832 (Ram, flash, 
disk.)
* Block reports will indicate the storage type.
* NN will store the storage type in the Block map
* Block locations returned by NN to client will have storage type (i.e. don't 
need the IsCached flag).
** NN will order the replicas locations based on closeness and speed; this will 
mean that the client side will automatically go to the best place (although we 
can have a smarter client do something different if desired.)
* NN will not count Ram replicas towards the normal replica count - this is one 
area where the ram replicas are treated differently. 
* This can support a usage model where the ram replicas are at each or only 
some of the disk replica locations.  

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707389#comment-13707389
 ] 

Colin Patrick McCabe commented on HDFS-4949:


I agree that there are some commonalities between the hierarchical storage 
management work and what we're doing here.

However, tiered storage management schemes put an entire block into a tier.  
This is different than what we want to (eventually) do with caching, which is 
cache only part of a block.  If we end up with a heavyweight scheme where the 
entire block has to be loaded into memory before any of it can be accessed, 
this may actually cause a performance regression, not an improvement.

In general, tiered storage management happens over a longer period of time than 
cache management.  We need to be responsive to changes that happen in just a 
few seconds.  In contrast, moving things from (say) hard disk to SSD and back 
will happen over minutes or hours.  The same code is not going to be able to 
handle both well.

The proposed implementations are quite different, as well.  HSM will involve 
copying block files between local FS directories.  Cache management will 
involve mlock'ing block files and passing the file descriptors to clients.  You 
might well ask, why not simply copy the block file to /dev/shm for your 
implementation?  However, this has the all or nothing problem described 
above (can't cache a partial block this way).  It also has a more subtle 
problem with what we are calling revocation.  Basically, a misbehaving client 
which holds an open file descriptor in /dev/shm can continue to use memory 
indefinitely-- there is no way the DataNode can ever revoke that memory.  This 
problem does not exist with the mlock solution which we have outlined here.

So while I think we should consider the possibility of sharing code as the two 
projects progress, I don't want to make this a subtask of that project.  There 
are just too many differences in goals and approaches for it to make sense.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707467#comment-13707467
 ] 

Andrew Wang commented on HDFS-4949:
---

Sanjay, thanks for your comments! I need to look more at HDFS-2832, but I think 
we've got some nice overlap. Particularly, I agree that cache would be just 
another DN Storage.

bq. Block reports will indicate the storage type.

I'm ok with this, but our initial design proposes separate heartbeats since 
cache reports might want to tick on a different interval. You might want quick 
cache reports if datanodes are doing their own LRU, but maybe you'd want to 
adaptively throttle it back if the NN is under load, since cache report 
processing can be expensive.

Separate heartbeats per-storage could definitely be added in later, so consider 
this a later-stage optimization.

bq. NN will order the replicas locations based on closeness and speed;

This is tricky because it depends on the network topology and workload. I don't 
want my single cached replica to get hammered by the entire cluster, but 
perhaps going to in-rack memory is better than local disk. I figure clients 
should be able to provide a configurable policy to their DFSClients.

I think we also still need the isCached flag for scheduling. Hypothetically, MR 
might always want to place on a memory replica over a disk replica. So, we 
could sort memory replicas first, then disk replicas. However, this squishes 
the existing ordering based on network topology used by DFSClients, and all our 
DFSClients end up hammering the cached replica at read time.

Note that even without a smarter DFSClient, we can get a lot of benefit just by 
making schedulers place tasks for memory-locality since our big win is going to 
be local memory reads. Colin's working on this in HDFS-4952.

bq. NN will not count Ram replicas towards the normal replica count - this is 
one area where the ram replicas are treated differently.
bq. This can support a usage model where the ram replicas are at each or only 
some of the disk replica locations.

+1, let's design for a future where cache might not be disk-backed. As Colin 
notes above, memory HSM is not easy, but the code should be flexible.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707536#comment-13707536
 ] 

Suresh Srinivas commented on HDFS-4949:
---

bq. However, tiered storage management schemes put an entire block into a tier. 
This is different than what we want to (eventually) do with caching, which is 
cache only part of a block. If we end up with a heavyweight scheme where the 
entire block has to be loaded into memory before any of it can be accessed, 
this may actually cause a performance regression, not an improvement.

This is one point I have been mulling over as well. I agree, partially cached 
data does not fit with storage hierarchy. But I was not sure partial cache is 
being consider in the first phase of implementation in this jira (sorry I am 
yet to read this carefully, If i missed details).

One other thing that we are considering is usage based extra replica to memory 
tier. Need to make some time to get that all into a doc.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707546#comment-13707546
 ] 

Todd Lipcon commented on HDFS-4949:
---

Hey folks. I agree that HSM is a much bigger task than what we're talking about 
here, and not certain they can fit into the same framework. During our early 
internal design discussions I'd suggested the same thing, but after an hour or 
two of throwing the idea around, we discounted it due to the reasons Colin 
mentioned above (partial caching and revocation).

Though partial caching isn't referenced in the doc, it's a straightforward 
extension that we plan to tackle down the road. For example, we can take each 
block, subdivide into 1MB chunks, and then report a bitmap indicating which 
chunks are cached. Taking advantage of the kernel lets us do this relatively 
easily calling mlock/munlock -- and the revocation problem is again simple 
because a misbehaving client won't be able to pin memory.

I don't think this work precludes later work on the idea of memory-only 
storages/replicas. That has other advantages, particularly on the *write* side 
for temporary data, etc. But is somewhat tricky to get right. When we do that, 
we should certainly look at it in a generalized way (RAM, SSD, Disk as a 
hierarchy).

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707572#comment-13707572
 ] 

Suresh Srinivas commented on HDFS-4949:
---

The problem for partical cache is though, while at file level or directory 
level it is easy to address the data to be cached, more granular level means 
that the cache management layer (hopefully I would like see it in namenode) 
will cause bloat in how much data must be tracked in cache manager.

How about making use of access patterns and build it instead of managed cache?

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4949) Centralized cache management in HDFS

2013-07-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707576#comment-13707576
 ] 

Todd Lipcon commented on HDFS-4949:
---

Yes, I think the explicit caching policies would operate mostly on a whole-file 
granularity. The DNs, though, can do local LRU tracking of sub-blocks. This is 
useful for cases like querying a Parquet file where a single file is made up of 
subranges corresponding to different columns. Some columns may be hot while 
others are cold, and if the DN can notice this and start offering zero-copy for 
the hot columns, it will be a significant performance win.

 Centralized cache management in HDFS
 

 Key: HDFS-4949
 URL: https://issues.apache.org/jira/browse/HDFS-4949
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang
 Attachments: caching-design-doc-2013-07-02.pdf


 HDFS currently has no support for managing or exposing in-memory caches at 
 datanodes. This makes it harder for higher level application frameworks like 
 Hive, Pig, and Impala to effectively use cluster memory, because they cannot 
 explicitly cache important datasets or place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira