[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822440#comment-13822440 ] Hudson commented on HDFS-5366: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1582 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1582/]) HDFS-5366. recaching improvements (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822415#comment-13822415 ] Hudson commented on HDFS-5366: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1608 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1608/]) HDFS-5366. recaching improvements (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822324#comment-13822324 ] Hudson commented on HDFS-5366: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #391 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/391/]) HDFS-5366. recaching improvements (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821644#comment-13821644 ] Hudson commented on HDFS-5366: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4725 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4725/]) HDFS-5366. recaching improvements (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1541647) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CacheReplicationMonitor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/MappableBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestPathBasedCacheRequests.java > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820965#comment-13820965 ] Hadoop QA commented on HDFS-5366: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613501/HDFS-5366.007.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5415//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5415//console This message is automatically generated. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820961#comment-13820961 ] Chris Nauroth commented on HDFS-5366: - +1 for the new patch. (I took another look at the latest version since a few things had changed.) > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820877#comment-13820877 ] Hadoop QA commented on HDFS-5366: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613472/HDFS-5366.006.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5409//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5409//console This message is automatically generated. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820835#comment-13820835 ] Colin Patrick McCabe commented on HDFS-5366: bq. The mlock stubbing will need to be rebased a bit for HDFS-5450 which was just committed, since it was still stubbing with MappableBlock#Mlocker. rebased in version 7 bq. There was also a fix to TestFsDatasetCache to reset the Mlocker stub to the default after it's set, will want the same thing here. yeah. bq. We probably want to stub mlock for TestPathBasedCacheRequests too, but I feel like we should leave at least one test calling mlock. Maybe in TestNativeIO? Yeah, TestNativeIO still tests "raw" mlock. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.007.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820755#comment-13820755 ] Andrew Wang commented on HDFS-5366: --- +1, looks basically good. Just a few notes: * The mlock stubbing will need to be rebased a bit for HDFS-5450 which was just committed, since it was still stubbing with MappableBlock#Mlocker. * There was also a fix to TestFsDatasetCache to reset the Mlocker stub to the default after it's set, will want the same thing here. * We probably want to stub mlock for TestPathBasedCacheRequests too, but I feel like we should leave at least one test calling mlock. Maybe in TestNativeIO? > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.006.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820710#comment-13820710 ] Colin Patrick McCabe commented on HDFS-5366: originally I was going to tackle stale / decommissioning / full nodes here, but in the interest of keeping patches from getting too big, let's do it in HDFS-5507. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch, HDFS-5366.006.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820649#comment-13820649 ] Hadoop QA commented on HDFS-5366: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613454/HDFS-5366.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5405//console This message is automatically generated. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch, > HDFS-5366.005.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820631#comment-13820631 ] Andrew Wang commented on HDFS-5366: --- Sounds good, but it seems like {{shouldSendCachingCommands}} and {{sendCachingCommands}} are still awful similar... > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820361#comment-13820361 ] Colin Patrick McCabe commented on HDFS-5366: bq. Since we already have a config key named "dfs.namenode.path.based.cache.refresh.interval.ms", can we call this one "dfs.namenode.path.based.cache.retry.interval.ms"? yeah, I like that name better bq. New configs should go in hdfs-default.xml too ok bq. Nit: extra newline in DatanodeManager#getCacheCommand ok bq. Javadoc on DatanodeDescriptor methods saying whether they take wallclock or monotonic time ok bq. Any reason to prefer the iterator-based removal over using clear? If it's not necessary, we could not do this to keep the diff small. This way, we only have to iterate over it once, not twice. bq. Extra imports in CacheReplicationMonitor, DatanodeDescriptor ok bq. In DatanodeManager, having variables named sendingCachingCommands and sendCachingCommands is confusing, rename to retryCachingCommands or something? renamed to {{shouldSendCachingCommands}} > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819860#comment-13819860 ] Chris Nauroth commented on HDFS-5366: - +1 for the patch after resolving Andrew's feedback. Thanks for incorporating the fix discussed earlier. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819708#comment-13819708 ] Andrew Wang commented on HDFS-5366: --- Hey Colin, thanks for the patch. Basically only nitty review stuff here: * Since we already have a config key named {{"dfs.namenode.path.based.cache.refresh.interval.ms"}}, can we call this one {{"dfs.namenode.path.based.cache.retry.interval.ms"}}? * New configs should go in hdfs-default.xml too * Nit: extra newline in DatanodeManager#getCacheCommand * Javadoc on DatanodeDescriptor methods saying whether they take wallclock or monotonic time * Any reason to prefer the iterator-based removal over using clear? If it's not necessary, we could not do this to keep the diff small. * Extra imports in CacheReplicationMonitor, DatanodeDescriptor * In DatanodeManager, having variables named {{sendingCachingCommands}} and {{sendCachingCommands}} is confusing, rename to {{retryCachingCommands}} or something? * Testing, it looks like the only changes are some additional prints. Can we get a test that actually verifies that the NN will resend cache/uncache commands? I bet you can intercept the commands with a spy or something so they don't reach the DN. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817838#comment-13817838 ] Colin Patrick McCabe commented on HDFS-5366: The eclipse:eclipse target failure doesn't have anything to do with this patch. This is also causing the bogus release audit: {code} !? hs_err_pid4577.log Lines that start with ? in the release audit report indicate files that do not have an Apache license header. {code} > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817737#comment-13817737 ] Hadoop QA commented on HDFS-5366: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612870/HDFS-5366.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5362//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/5362//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5362//console This message is automatically generated. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817613#comment-13817613 ] Colin Patrick McCabe commented on HDFS-5366: here's a new patch incorporating Chris' fix. The overall idea here is to keep lists of replicas to cache/uncache around until the DN replies and says that they've been acted on. This is different than the current scheme, where they are "fire and forget." To prevent re-sending these commands too often, this introduces a per-DN timer which sets the maximum rate at which commands can be re-sent. (This timer can be overridden by the cache rescanner thread changing what should be cached, though.) > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch, HDFS-5366.002.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816843#comment-13816843 ] Colin Patrick McCabe commented on HDFS-5366: good find, Chris. We definitely should update the {{lastCachingDirectiveSentTimeMs}} just once in that function. As you said, I'm waiting for 5394 to land before rebasing this. I kicked the jenkins build, but it's still pending. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: HDFS-4949 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816802#comment-13816802 ] Chris Nauroth commented on HDFS-5366: - I tested this patch and found that blocks were never uncaching. The NameNode never sent DNA_UNCACHE messages to the DataNode. The reason is that there are separate calls to {{DatanodeManager#getCacheCommand}} to get the DNA_CACHE set followed by the DNA_UNCACHE set. The method internally resets the last message time for the DataNode. This means that when it's time to send messages, the first call for the DNA_CACHE messages succeeds and resets the clock for that DataNode to right now. Then, the second call for the DNA_UNCACHE messages always returns null, because it looks like it's not time to send messages. To solve this, we need to set the DataNode's last caching directive sent time just once, after calculating both the DNA_CACHE and DNA_UNCACHE commands. I changed the code as follows to do this. Feel free to incorporate it into the next patch. (I'm not uploading a new patch right now, because I don't want to detangle it out of the HDFS-5394 patch applied in my environment.) In {{DatanodeManager#handleHeartbeat}}: {code} long monoTimeMs = Time.monotonicNow(); if (sendCachingCommands) { if ((monoTimeMs - nodeinfo.getLastCachingDirectiveSentTimeMs()) >= timeBetweenResendingCachingDirectivesMs) { DatanodeCommand pendingCacheCommand = getCacheCommand( nodeinfo.getPendingCached(), nodeinfo, DatanodeProtocol.DNA_CACHE, blockPoolId); if (pendingCacheCommand != null) { cmds.add(pendingCacheCommand); } DatanodeCommand pendingUncacheCommand = getCacheCommand( nodeinfo.getPendingUncached(), nodeinfo, DatanodeProtocol.DNA_UNCACHE, blockPoolId); if (pendingUncacheCommand != null) { cmds.add(pendingUncacheCommand); } nodeinfo.setLastCachingDirectiveSentTimeMs(monoTimeMs); } } {code} And {{DatanodeManager#getCacheCommand}}: {code} /** * Convert a CachedBlockList into a DatanodeCommand with a list of blocks. * * @param list The {@link CachedBlocksList}. This function * clears the list. * @param datanode The datanode. * @param action The action to perform in the command. * @param poolId The block pool id. * @return A DatanodeCommand to be sent back to the DN, or null if * there is nothing to be done. */ private DatanodeCommand getCacheCommand(CachedBlocksList list, DatanodeDescriptor datanode, int action, String poolId) { int length = list.size(); if (length == 0) { return null; } // Read and clear the existing cache commands. long[] blockIds = new long[length]; int i = 0; for (Iterator iter = list.iterator(); iter.hasNext(); ) { CachedBlock cachedBlock = iter.next(); blockIds[i++] = cachedBlock.getBlockId(); iter.remove(); } return new BlockIdCommand(action, poolId, blockIds); } {code} I re-tested with these changes, and it worked. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: HDFS-4949 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > Attachments: HDFS-5366-caching.001.patch > > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale or decomissioning DataNodes > (although we should not recache things stored on such nodes until they're > declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797468#comment-13797468 ] Colin Patrick McCabe commented on HDFS-5366: The thing about block replication is, if you lose all copies of the block, you have a problem. For us, if we lose all cache replicas, it's not a big deal. It's not obvious that a block which has 2 cached replicas out of 3 requested should be given lower priority than one with 0 out of 3. Maybe the 2/3 block is just that much more important. It will depend on the which pools the requests came from. I guess we'll have to do that as part of the effort to do pool quotas. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: HDFS-4949 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale DataNodes (although we > should not recache things stored on such nodes until they're declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797325#comment-13797325 ] Colin Patrick McCabe commented on HDFS-5366: As andrew pointed out on HDFS-5096, we should also kick the CRMon on a DN failure. We should also avoid scheduling new work on decommissioning nodes (as well as stale nodes) > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: HDFS-4949 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale DataNodes (although we > should not recache things stored on such nodes until they're declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797319#comment-13797319 ] Andrew Wang commented on HDFS-5366: --- One interesting idea from the block replication code is having priorities for replication work based on the current and expected replication factor. Maybe a "0 of 3" case should be rescheduled elsewhere more quickly than the 10.5 minute dead datanode interval, while we let a mild case of "2 of 3" sit. I don't think this will require tracking our own list of "stale" or "dead" nodes, just a list of nodes we've already tried for an outstanding request. We reset if we've tried all targets. I seem to remember the block recovery code or something doing this. Avoiding stale nodes might also be good enough, if we think that heartbeats are a good proxy for the DN's ability to cache/uncache. This probably isn't true for uncaching though, since as you've noted, a hung client could just hold onto a ZCR lease. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: HDFS-4949 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale DataNodes (although we > should not recache things stored on such nodes until they're declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5366) recaching improvements
[ https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13796322#comment-13796322 ] Colin Patrick McCabe commented on HDFS-5366: The other question that came up in discussion on HDFS-5096 is whether we should have a dedicated thread (independent of the {{CacheReplicationMonitor}} thread) which periodically re-examines the outstanding cache and uncache requests, and reschedules them to a different node if they aren't fulfilled. I've thought about this, but I'm not sure that we need it. The problem is that both caching and uncaching take time. Caching takes time because it involves reading from disk. Uncaching takes time because a client might have an mmap that needs to be revoked. The involuntary revocation period will be at least 5 minutes, to avoid having clients burned by GCs. if we're too aggressive about rescheduling our cache/uncache operations, we may create a lot of churn. If the period of such a "rescheduler thread" would be measured in minutes, isn't it simpler to just use the rescanning thread to handle this scenario? The other problem is that we currently rely on the {{DatanodeManager}} to tell us when a node is bad. Its timeouts are generous (10.5 minutes by default to declare a node dead), so the proposed "rescheduler" would either have to maintain its own list of who is naughty and nice, or have a really long period (again overlapping with the rescanner thread). I don't really want to duplicate the deadNodes list... I do think we should resend the DNA_CACHE, etc. as I mentioned above. Networks do lose messages, after all. But we might have to assume that if a DN tells us it can cache X bytes, that it's telling the truth. Otherwise, the failure cases we have to think about tend to proliferate. > recaching improvements > -- > > Key: HDFS-5366 > URL: https://issues.apache.org/jira/browse/HDFS-5366 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: HDFS-4949 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe > > There are a few things about our HDFS-4949 recaching strategy that could be > improved. > * We should monitor the DN's maximum and current mlock'ed memory consumption > levels, so that we don't ask the DN to do stuff it can't. > * We should not try to initiate caching on stale DataNodes (although we > should not recache things stored on such nodes until they're declared dead). > * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few > times before giving up. Currently, we only send it once. -- This message was sent by Atlassian JIRA (v6.1#6144)