[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822440#comment-13822440
 ] 

Hudson commented on HDFS-5366:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1582 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1582/])
HDFS-5366. recaching improvements (cmccabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java


> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822415#comment-13822415
 ] 

Hudson commented on HDFS-5366:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1608 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1608/])
HDFS-5366. recaching improvements (cmccabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java


> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822324#comment-13822324
 ] 

Hudson commented on HDFS-5366:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #391 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/391/])
HDFS-5366. recaching improvements (cmccabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java


> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821644#comment-13821644
 ] 

Hudson commented on HDFS-5366:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4725 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4725/])
HDFS-5366. recaching improvements (cmccabe) (cmccabe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java


> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820965#comment-13820965
 ] 

Hadoop QA commented on HDFS-5366:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12613501/HDFS-5366.007.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5415//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5415//console

This message is automatically generated.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820961#comment-13820961
 ] 

Chris Nauroth commented on HDFS-5366:
-

+1 for the new patch.  (I took another look at the latest version since a few 
things had changed.)

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820877#comment-13820877
 ] 

Hadoop QA commented on HDFS-5366:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12613472/HDFS-5366.006.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5409//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5409//console

This message is automatically generated.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820835#comment-13820835
 ] 

Colin Patrick McCabe commented on HDFS-5366:


bq. The mlock stubbing will need to be rebased a bit for HDFS-5450 which was 
just committed, since it was still stubbing with MappableBlock#Mlocker.

rebased in version 7

bq. There was also a fix to TestFsDatasetCache to reset the Mlocker stub to the 
default after it's set, will want the same thing here.

yeah.

bq. We probably want to stub mlock for TestPathBasedCacheRequests too, but I 
feel like we should leave at least one test calling mlock. Maybe in 
TestNativeIO?

Yeah, TestNativeIO still tests "raw" mlock.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.007.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820755#comment-13820755
 ] 

Andrew Wang commented on HDFS-5366:
---

+1, looks basically good. Just a few notes:

* The mlock stubbing will need to be rebased a bit for HDFS-5450 which was just 
committed, since it was still stubbing with MappableBlock#Mlocker.
* There was also a fix to TestFsDatasetCache to reset the Mlocker stub to the 
default after it's set, will want the same thing here.
* We probably want to stub mlock for TestPathBasedCacheRequests too, but I feel 
like we should leave at least one test calling mlock. Maybe in TestNativeIO?

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.006.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820710#comment-13820710
 ] 

Colin Patrick McCabe commented on HDFS-5366:


originally I was going to tackle stale / decommissioning / full nodes here, but 
in the interest of keeping patches from getting too big, let's do it in 
HDFS-5507.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch, HDFS-5366.006.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820649#comment-13820649
 ] 

Hadoop QA commented on HDFS-5366:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12613454/HDFS-5366.005.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5405//console

This message is automatically generated.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, 
> HDFS-5366.005.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820631#comment-13820631
 ] 

Andrew Wang commented on HDFS-5366:
---

Sounds good, but it seems like {{shouldSendCachingCommands}} and 
{{sendCachingCommands}} are still awful similar...

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820361#comment-13820361
 ] 

Colin Patrick McCabe commented on HDFS-5366:


bq. Since we already have a config key named 
"dfs.namenode.path.based.cache.refresh.interval.ms", can we call this one 
"dfs.namenode.path.based.cache.retry.interval.ms"?

yeah, I like that name better

bq. New configs should go in hdfs-default.xml too

ok

bq. Nit: extra newline in DatanodeManager#getCacheCommand

ok

bq. Javadoc on DatanodeDescriptor methods saying whether they take wallclock or 
monotonic time

ok

bq. Any reason to prefer the iterator-based removal over using clear? If it's 
not necessary, we could not do this to keep the diff small.

This way, we only have to iterate over it once, not twice.

bq. Extra imports in CacheReplicationMonitor, DatanodeDescriptor

ok

bq. In DatanodeManager, having variables named sendingCachingCommands and 
sendCachingCommands is confusing, rename to retryCachingCommands or something?

renamed to {{shouldSendCachingCommands}}

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-11 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819860#comment-13819860
 ] 

Chris Nauroth commented on HDFS-5366:
-

+1 for the patch after resolving Andrew's feedback.  Thanks for incorporating 
the fix discussed earlier.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-11 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819708#comment-13819708
 ] 

Andrew Wang commented on HDFS-5366:
---

Hey Colin, thanks for the patch. Basically only nitty review stuff here:

* Since we already have a config key named 
{{"dfs.namenode.path.based.cache.refresh.interval.ms"}}, can we call this one 
{{"dfs.namenode.path.based.cache.retry.interval.ms"}}?
* New configs should go in hdfs-default.xml too
* Nit: extra newline in DatanodeManager#getCacheCommand
* Javadoc on DatanodeDescriptor methods saying whether they take wallclock or 
monotonic time
* Any reason to prefer the iterator-based removal over using clear? If it's not 
necessary, we could not do this to keep the diff small.
* Extra imports in CacheReplicationMonitor, DatanodeDescriptor
* In DatanodeManager, having variables named {{sendingCachingCommands}} and 
{{sendCachingCommands}} is confusing, rename to {{retryCachingCommands}} or 
something?
* Testing, it looks like the only changes are some additional prints. Can we 
get a test that actually verifies that the NN will resend cache/uncache 
commands? I bet you can intercept the commands with a spy or something so they 
don't reach the DN.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-08 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817838#comment-13817838
 ] 

Colin Patrick McCabe commented on HDFS-5366:


The eclipse:eclipse target failure doesn't have anything to do with this patch. 
 This is also causing the bogus release audit:

{code}
 !? hs_err_pid4577.log
Lines that start with ? in the release audit report indicate files that do 
not have an Apache license header.
{code}

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817737#comment-13817737
 ] 

Hadoop QA commented on HDFS-5366:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12612870/HDFS-5366.002.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5362//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5362//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5362//console

This message is automatically generated.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-08 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817613#comment-13817613
 ] 

Colin Patrick McCabe commented on HDFS-5366:


here's a new patch incorporating Chris' fix.

The overall idea here is to keep lists of replicas to cache/uncache around 
until the DN replies and says that they've been acted on.  This is different 
than the current scheme, where they are "fire and forget."

To prevent re-sending these commands too often, this introduces a per-DN timer 
which sets the maximum rate at which commands can be re-sent.  (This timer can 
be overridden by the cache rescanner thread changing what should be cached, 
though.)

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-07 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816843#comment-13816843
 ] 

Colin Patrick McCabe commented on HDFS-5366:


good find, Chris.  We definitely should update the 
{{lastCachingDirectiveSentTimeMs}} just once in that function.  As you said, 
I'm waiting for 5394 to land before rebasing this.  I kicked the jenkins build, 
but it's still pending.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: HDFS-4949
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-11-07 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816802#comment-13816802
 ] 

Chris Nauroth commented on HDFS-5366:
-

I tested this patch and found that blocks were never uncaching.  The NameNode 
never sent DNA_UNCACHE messages to the DataNode.  The reason is that there are 
separate calls to {{DatanodeManager#getCacheCommand}} to get the DNA_CACHE set 
followed by the DNA_UNCACHE set.  The method internally resets the last message 
time for the DataNode.  This means that when it's time to send messages, the 
first call for the DNA_CACHE messages succeeds and resets the clock for that 
DataNode to right now.  Then, the second call for the DNA_UNCACHE messages 
always returns null, because it looks like it's not time to send messages.

To solve this, we need to set the DataNode's last caching directive sent time 
just once, after calculating both the DNA_CACHE and DNA_UNCACHE commands.  I 
changed the code as follows to do this.  Feel free to incorporate it into the 
next patch.  (I'm not uploading a new patch right now, because I don't want to 
detangle it out of the HDFS-5394 patch applied in my environment.)

In {{DatanodeManager#handleHeartbeat}}:

{code}
long monoTimeMs = Time.monotonicNow();
if (sendCachingCommands) {
  if ((monoTimeMs - nodeinfo.getLastCachingDirectiveSentTimeMs()) >=
  timeBetweenResendingCachingDirectivesMs) {
DatanodeCommand pendingCacheCommand = getCacheCommand(
nodeinfo.getPendingCached(), nodeinfo,
DatanodeProtocol.DNA_CACHE, blockPoolId);
if (pendingCacheCommand != null) {
  cmds.add(pendingCacheCommand);
}
DatanodeCommand pendingUncacheCommand = getCacheCommand(
nodeinfo.getPendingUncached(), nodeinfo,
DatanodeProtocol.DNA_UNCACHE, blockPoolId);
if (pendingUncacheCommand != null) {
  cmds.add(pendingUncacheCommand);
}
nodeinfo.setLastCachingDirectiveSentTimeMs(monoTimeMs);
  }
}
{code}

And {{DatanodeManager#getCacheCommand}}:

{code}

  /**
   * Convert a CachedBlockList into a DatanodeCommand with a list of blocks.
   *
   * @param list   The {@link CachedBlocksList}.  This function 
   *   clears the list.
   * @param datanode   The datanode.
   * @param action The action to perform in the command.
   * @param poolId The block pool id.
   * @return   A DatanodeCommand to be sent back to the DN, or null if
   *   there is nothing to be done.
   */
  private DatanodeCommand getCacheCommand(CachedBlocksList list,
  DatanodeDescriptor datanode, int action, String poolId) {
int length = list.size();
if (length == 0) {
  return null;
}
// Read and clear the existing cache commands.
long[] blockIds = new long[length];
int i = 0;
for (Iterator iter = list.iterator();
iter.hasNext(); ) {
  CachedBlock cachedBlock = iter.next();
  blockIds[i++] = cachedBlock.getBlockId();
  iter.remove();
}
return new BlockIdCommand(action, poolId, blockIds);
  }
{code}

I re-tested with these changes, and it worked.


> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: HDFS-4949
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-5366-caching.001.patch
>
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale or decomissioning DataNodes 
> (although we should not recache things stored on such nodes until they're 
> declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-10-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797468#comment-13797468
 ] 

Colin Patrick McCabe commented on HDFS-5366:


The thing about block replication is, if you lose all copies of the block, you 
have a problem.  For us, if we lose all cache replicas, it's not a big deal.  
It's not obvious that a block which has 2 cached replicas out of 3 requested 
should be given lower priority than one with 0 out of 3.  Maybe the 2/3 block 
is just that much more important.  It will depend on the which pools the 
requests came from.  I guess we'll have to do that as part of the effort to do 
pool quotas.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: HDFS-4949
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale DataNodes (although we 
> should not recache things stored on such nodes until they're declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-10-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797325#comment-13797325
 ] 

Colin Patrick McCabe commented on HDFS-5366:


As andrew pointed out on HDFS-5096, we should also kick the CRMon on a DN 
failure.  We should also avoid scheduling new work on decommissioning nodes (as 
well as stale nodes)

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: HDFS-4949
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale DataNodes (although we 
> should not recache things stored on such nodes until they're declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797319#comment-13797319
 ] 

Andrew Wang commented on HDFS-5366:
---

One interesting idea from the block replication code is having priorities for 
replication work based on the current and expected replication factor. Maybe a 
"0 of 3" case should be rescheduled elsewhere more quickly than the 10.5 minute 
dead datanode interval, while we let a mild case of "2 of 3" sit.

I don't think this will require tracking our own list of "stale" or "dead" 
nodes, just a list of nodes we've already tried for an outstanding request. We 
reset if we've tried all targets. I seem to remember the block recovery code or 
something doing this. Avoiding stale nodes might also be good enough, if we 
think that heartbeats are a good proxy for the DN's ability to cache/uncache. 
This probably isn't true for uncaching though, since as you've noted, a hung 
client could just hold onto a ZCR lease.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: HDFS-4949
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale DataNodes (although we 
> should not recache things stored on such nodes until they're declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-10-15 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13796322#comment-13796322
 ] 

Colin Patrick McCabe commented on HDFS-5366:


The other question that came up in discussion on HDFS-5096 is whether we should 
have a dedicated thread (independent of the {{CacheReplicationMonitor}} thread) 
which periodically re-examines the outstanding cache and uncache requests, and 
reschedules them to a different node if they aren't fulfilled.  I've thought 
about this, but I'm not sure that we need it.

The problem is that both caching and uncaching take time.  Caching takes time 
because it involves reading from disk.  Uncaching takes time because a client 
might have an mmap that needs to be revoked.  The involuntary revocation period 
will be at least 5 minutes, to avoid having clients burned by GCs.

if we're too aggressive about rescheduling our cache/uncache operations, we may 
create a lot of churn.  If the period of such a "rescheduler thread" would be 
measured in minutes, isn't it simpler to just use the rescanning thread to 
handle this scenario?

The other problem is that we currently rely on the {{DatanodeManager}} to tell 
us when a node is bad.  Its timeouts are generous (10.5 minutes by default to 
declare a node dead), so the proposed "rescheduler" would either have to 
maintain its own list of who is naughty and nice, or have a really long period 
(again overlapping with the rescanner thread).  I don't really want to 
duplicate the deadNodes list...

I do think we should resend the DNA_CACHE, etc. as I mentioned above.  Networks 
do lose messages, after all.  But we might have to assume that if a DN tells us 
it can cache X bytes, that it's telling the truth.  Otherwise, the failure 
cases we have to think about tend to proliferate.

> recaching improvements
> --
>
> Key: HDFS-5366
> URL: https://issues.apache.org/jira/browse/HDFS-5366
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: HDFS-4949
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>
> There are a few things about our HDFS-4949 recaching strategy that could be 
> improved.
> * We should monitor the DN's maximum and current mlock'ed memory consumption 
> levels, so that we don't ask the DN to do stuff it can't.
> * We should not try to initiate caching on stale DataNodes (although we 
> should not recache things stored on such nodes until they're declared dead).
> * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
> times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)