[jira] [Commented] (YARN-2917) Potential deadlock in AsyncDispatcher when system.exit called in AsyncDispatcher#dispatch and AsyscDispatcher#serviceStop from shutdown hook
[ https://issues.apache.org/jira/browse/YARN-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237677#comment-14237677 ] Hadoop QA commented on YARN-2917: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685697/0001-YARN-2917.patch against trunk revision 120e1de. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6029//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6029//console This message is automatically generated. Potential deadlock in AsyncDispatcher when system.exit called in AsyncDispatcher#dispatch and AsyscDispatcher#serviceStop from shutdown hook Key: YARN-2917 URL: https://issues.apache.org/jira/browse/YARN-2917 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2917.patch I encoutered scenario where RM hanged while shutting down and keep on logging {{2014-12-03 19:32:44,283 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain.}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237695#comment-14237695 ] Hudson commented on YARN-2927: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #33 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/33/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237694#comment-14237694 ] Hudson commented on YARN-1492: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #33 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/33/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237703#comment-14237703 ] Hudson commented on YARN-2927: -- ABORTED: Integrated in Hadoop-Mapreduce-trunk #1984 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1984/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237702#comment-14237702 ] Hudson commented on YARN-1492: -- ABORTED: Integrated in Hadoop-Mapreduce-trunk #1984 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1984/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2136) RMStateStore can explicitly handle store/update events when fenced
[ https://issues.apache.org/jira/browse/YARN-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237755#comment-14237755 ] Varun Saxena commented on YARN-2136: Thanks [~jianhe] for reviewing and committing this. RMStateStore can explicitly handle store/update events when fenced -- Key: YARN-2136 URL: https://issues.apache.org/jira/browse/YARN-2136 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Jian He Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2136.002.patch, YARN-2136.003.patch, YARN-2136.004.patch, YARN-2136.005.patch, YARN-2136.patch RMStateStore can choose to handle/ignore store/update events upfront instead of invoking more ZK operations if state store is at fenced state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237766#comment-14237766 ] Hudson commented on YARN-1492: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #32 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/32/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237767#comment-14237767 ] Hudson commented on YARN-2927: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #32 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/32/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1423#comment-1423 ] Rohith commented on YARN-2762: -- I believe test failures are either unrelated or temporary. RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237782#comment-14237782 ] Hadoop QA commented on YARN-2637: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685475/YARN-2637.16.patch against trunk revision 120e1de. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueParsing org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6030//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6030//console This message is automatically generated. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237802#comment-14237802 ] Hudson commented on YARN-1492: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #769 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/769/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237803#comment-14237803 ] Hudson commented on YARN-2927: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #769 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/769/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2291) Timeline and RM web services should use same authentication code
[ https://issues.apache.org/jira/browse/YARN-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237805#comment-14237805 ] Varun Vasudev commented on YARN-2291: - YARN-2656 changes the RM to use the filter in hadoop-common. Closing this. Timeline and RM web services should use same authentication code Key: YARN-2291 URL: https://issues.apache.org/jira/browse/YARN-2291 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 The TimelineServer and the RM web services have very similar requirements and implementation for authentication via delegation tokens apart from the fact that the RM web services requires delegation tokens to be passed as a header. They should use the same code base instead of different implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2291) Timeline and RM web services should use same authentication code
[ https://issues.apache.org/jira/browse/YARN-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev resolved YARN-2291. - Resolution: Fixed Fix Version/s: 2.6.0 Timeline and RM web services should use same authentication code Key: YARN-2291 URL: https://issues.apache.org/jira/browse/YARN-2291 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 The TimelineServer and the RM web services have very similar requirements and implementation for authentication via delegation tokens apart from the fact that the RM web services requires delegation tokens to be passed as a header. They should use the same code base instead of different implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2292) RM web services should use hadoop-common for authentication using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev resolved YARN-2292. - Resolution: Fixed Fix Version/s: 2.6.0 YARN-2656 changes the RM to use the filter in hadoop-common. Closing this. RM web services should use hadoop-common for authentication using delegation tokens --- Key: YARN-2292 URL: https://issues.apache.org/jira/browse/YARN-2292 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 HADOOP-10771 refactors the WebHDFS authentication code to hadoop-common. YARN-2290 will add support for passing delegation tokens via headers. Once support is added RM web services should use the authentication code from hadoop-common -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2426) ResourceManger is not able renew WebHDFS token when application submitted by Yarn WebService
[ https://issues.apache.org/jira/browse/YARN-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev resolved YARN-2426. - Resolution: Fixed Fix Version/s: 2.6.0 Fixed with HDFS-6904 exposing an API to allow clients to set the service. ResourceManger is not able renew WebHDFS token when application submitted by Yarn WebService Key: YARN-2426 URL: https://issues.apache.org/jira/browse/YARN-2426 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 2.6.0 Environment: Hadoop Keberos (Secure) cluster with LinuxContainerExcutor is enabled With SPNEGO on for Yarn new RM web services for application submission So during application submission xml/json structure was pass webhdfs token Reporter: Karam Singh Assignee: Varun Vasudev Fix For: 2.6.0 Encountered this issue during using new YARN's RM WS for application submission, on single node cluster while submitting Distributed Shell application using RM WS(webservice). For this we need pass custom script and AppMaster jar along with webhdfs token. Application was failing with ResouceManager was failing to renew token for user (appOwner). So RM was Rejecting application with following exception trace in RM log: {code} 2014-08-19 03:12:54,733 WARN security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleDTRenewerAppSubmitEvent(661)) - Unable to add the application to the delegation token renewer. java.io.IOException: Failed to renew token: Kind: WEBHDFS delegation, Service: NNHOST:FSPORT, Ident: (WEBHDFS delegation token for hrt_qa) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:394) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$5(DelegationTokenRenewer.java:357) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:657) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:638) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unexpected HTTP response: code=-1 != 200, op=RENEWDELEGATIONTOKEN, message=null at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:331) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:598) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:448) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:477) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:473) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.renewDelegationToken(WebHdfsFileSystem.java:1318) at org.apache.hadoop.hdfs.web.TokenAspect$TokenManager.renew(TokenAspect.java:73) at org.apache.hadoop.security.token.Token.renew(Token.java:377) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:477) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:473) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:392) ... 6 more Caused by: java.io.IOException: The error stream is null. at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.jsonParse(WebHdfsFileSystem.java:304) at
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237830#comment-14237830 ] Hudson commented on YARN-2927: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1964 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1964/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237829#comment-14237829 ] Hudson commented on YARN-1492: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1964 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1964/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237890#comment-14237890 ] Hudson commented on YARN-2927: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #32 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/32/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237889#comment-14237889 ] Hudson commented on YARN-1492: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #32 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/32/]) YARN-2927. [YARN-1492] InMemorySCMStore properties are inconsistent. (Ray Chiang via kasha) (kasha: rev 120e1decd7f6861e753269690d454cb14c240857) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237910#comment-14237910 ] Hadoop QA commented on YARN-2902: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 8963515. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6031//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6031//console This message is automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-2910: Attachment: YARN-2910.4.patch I did not change the assignment :-( yes, the {{when(schedulable.getResourceUsage()).thenReturn(smallResource);}} should not have been in the patch, my mistake. Not sure how that ended up in the patch I used it during development but not in the last tests. On my machine the test failed with just adding applications. The issue seems to be in the initialisation of the application attempt. When I added debug into the test run I can see the initialisation of the app attempt in the mock taking up a lot of time which meant that the {{getResourceUsage}} almost always ran over an empty list unless the number of iterations was raised above 1000. As soon as I moved the creation out of the thread the failure occurs within 5 iterations of the {{getResourceUsage}} call in the second thread after adding less than 15 or so app instances. I have attached an updated patch which passes with the new code and has a 100% failure rate with the old code. This version of the test runs faster and is more reliable than the previous ones. FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Ray Chiang Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached file. We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2922) Concurrent Modification Exception in LeafQueue when collecting applications
[ https://issues.apache.org/jira/browse/YARN-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2922: - Attachment: 0001-YARN-2922.patch Concurrent Modification Exception in LeafQueue when collecting applications --- Key: YARN-2922 URL: https://issues.apache.org/jira/browse/YARN-2922 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Rohith Attachments: 0001-YARN-2922.patch java.util.ConcurrentModificationException at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.collectSchedulerApplications(LeafQueue.java:1618) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getAppsInQueue(CapacityScheduler.java:1119) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:798) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:234) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2892) Unable to get AMRMToken in unmanaged AM when using a secure cluster
[ https://issues.apache.org/jira/browse/YARN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238000#comment-14238000 ] Junping Du commented on YARN-2892: -- Wait... I looked at the patch again, looks like it will bring serious incompatible issue: the applicationReport return to client include short name now rather than full name before. We should be super carefully now as we are supporting YARN rolling upgrade since 2.6. Any other ways? Unable to get AMRMToken in unmanaged AM when using a secure cluster --- Key: YARN-2892 URL: https://issues.apache.org/jira/browse/YARN-2892 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Sevada Abraamyan Assignee: Sevada Abraamyan Attachments: YARN-2892.patch, YARN-2892.patch, YARN-2892.patch An AMRMToken is retrieved from the ApplicationReport by the YarnClient. When the RM creates the ApplicationReport and sends it back to the client it makes a simple security check whether it should include the AMRMToken in the report (See createAndGetApplicationReport in RMAppImpl).This security check verifies that the user who submitted the original application is the same user who is requesting the ApplicationReport. If they are indeed the same user then it includes the AMRMToken, otherwise it does not include it. The problem arises from the fact that when an application is submitted, the RM saves the short username of the user who created the application (See submitApplication in ClientRmService). Afterwards when the ApplicationReport is requested, the system tries to match the full username of the requester against the previously stored short username. In a secure cluster using Kerberos this check fails because the principle is stripped from the username when we request a short username. So for example the short username might be Foo whereas the full username is f...@company.com Note: A very similar problem has been previously reported ([Yarn-2232|https://issues.apache.org/jira/browse/YARN-2232]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238007#comment-14238007 ] Wilfred Spiegelenburg commented on YARN-2910: - The fix causes the {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler}} to fail. There is a deadlock that is created by the synchronised read access in the leaf queue for the {{runnableApps}}. If an app has two containers at different stages in the allocation it can happen that the {{appAttempt}} is locked by one and the {{runnableApps}} by the second causing the hang. This is what I was afraid of when I mentioned the slow down, I did not anticipate it this bad but the number of reads far outnumber the writes. The earlier proposed CopyOnWriteArrayList will also not work due to the sort that is called (and I overlooked) which is not supported. FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Ray Chiang Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached file. We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2464) Provide Hadoop as a local resource (on HDFS) which can be used by other projects
[ https://issues.apache.org/jira/browse/YARN-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2464: - Summary: Provide Hadoop as a local resource (on HDFS) which can be used by other projects (was: Provide Hadoop as a local resource (on HDFS) which can be used by other projcets) Provide Hadoop as a local resource (on HDFS) which can be used by other projects Key: YARN-2464 URL: https://issues.apache.org/jira/browse/YARN-2464 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Junping Du DEFAULT_YARN_APPLICATION_CLASSPATH are used by YARN projects to setup their AM / task classpaths if they have a dependency on Hadoop libraries. It'll be useful to provide similar access to a Hadoop tarball (Hadoop libs, native libraries) etc, which could be used instead - for applications which do not want to rely upon Hadoop versions from a cluster node. This would also require functionality to update the classpath/env for the apps based on the structure of the tar. As an example, MR has support for a full tar (for rolling upgrades). Similarly, Tez ships hadoop libraries along with it's build. I'm not sure about the Spark / Storm / HBase model for this - but using a common copy instead of everyone localizing Hadoop libraries would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2922) Concurrent Modification Exception in LeafQueue when collecting applications
[ https://issues.apache.org/jira/browse/YARN-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238058#comment-14238058 ] Hadoop QA commented on YARN-2922: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685762/0001-YARN-2922.patch against trunk revision 8963515. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6034//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6034//console This message is automatically generated. Concurrent Modification Exception in LeafQueue when collecting applications --- Key: YARN-2922 URL: https://issues.apache.org/jira/browse/YARN-2922 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Rohith Attachments: 0001-YARN-2922.patch java.util.ConcurrentModificationException at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.collectSchedulerApplications(LeafQueue.java:1618) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getAppsInQueue(CapacityScheduler.java:1119) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:798) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:234) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238075#comment-14238075 ] Mit Desai commented on YARN-2900: - Thanks. I will check that out. Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238096#comment-14238096 ] Hadoop QA commented on YARN-2910: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685755/YARN-2910.4.patch against trunk revision 8963515. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestRM org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6033//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6033//console This message is automatically generated. FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Ray Chiang Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached file. We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2683) registry config options: document and move to core-default
[ https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2683: - Attachment: HADOOP-10530-005.patch patch -005; rebased against trunk commit 144da2 registry config options: document and move to core-default -- Key: YARN-2683 URL: https://issues.apache.org/jira/browse/YARN-2683 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-10530-005.patch, YARN-2683-001.patch, YARN-2683-002.patch, YARN-2683-003.patch Original Estimate: 1h Remaining Estimate: 1h Add to {{yarn-site}} a page on registry configuration parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238130#comment-14238130 ] Wangda Tan commented on YARN-2762: -- [~rohithsharma], Thanks for update, Last two minor comments: - Rename {{NO_LABEL}} to {{NO_LABEL_ERR_MSG}}. - Make {{No node-to-labels mappings are specified}} to a final field as well, like {{NO_MAPPING_ERR_MSG}}. Wangda RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2892) Unable to get AMRMToken in unmanaged AM when using a secure cluster
[ https://issues.apache.org/jira/browse/YARN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238137#comment-14238137 ] Rohith commented on YARN-2892: -- bq. the applicationReport return to client include short name now rather than full name before. I did not get where exactly breaking compatibility.. Before patch also, application report sends short name instead of full name. Am I missing anything? Unable to get AMRMToken in unmanaged AM when using a secure cluster --- Key: YARN-2892 URL: https://issues.apache.org/jira/browse/YARN-2892 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Sevada Abraamyan Assignee: Sevada Abraamyan Attachments: YARN-2892.patch, YARN-2892.patch, YARN-2892.patch An AMRMToken is retrieved from the ApplicationReport by the YarnClient. When the RM creates the ApplicationReport and sends it back to the client it makes a simple security check whether it should include the AMRMToken in the report (See createAndGetApplicationReport in RMAppImpl).This security check verifies that the user who submitted the original application is the same user who is requesting the ApplicationReport. If they are indeed the same user then it includes the AMRMToken, otherwise it does not include it. The problem arises from the fact that when an application is submitted, the RM saves the short username of the user who created the application (See submitApplication in ClientRmService). Afterwards when the ApplicationReport is requested, the system tries to match the full username of the requester against the previously stored short username. In a secure cluster using Kerberos this check fails because the principle is stripped from the username when we request a short username. So for example the short username might be Foo whereas the full username is f...@company.com Note: A very similar problem has been previously reported ([Yarn-2232|https://issues.apache.org/jira/browse/YARN-2232]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238140#comment-14238140 ] Hadoop QA commented on YARN-2637: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685475/YARN-2637.16.patch against trunk revision 144da2e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6035//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6035//console This message is automatically generated. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238139#comment-14238139 ] Junping Du commented on YARN-2637: -- I manually kick off Jenkins test again for latest patch. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.2.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2910: - Assignee: Wilfred Spiegelenburg (was: Ray Chiang) FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached file. We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2495: Attachment: YARN-2495.20141208-1.patch hi [~wangda], I have set the default implementation class to be null as per your comment, but as you mentioned after review, based on feedback need to set default class with either configuration based or script based Node Labels provider as default provider class Allow admin specify labels from each NM (Distributed configuration) --- Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, YARN-2495.20141119-1.patch, YARN-2495.20141126-1.patch, YARN-2495.20141204-1.patch, YARN-2495.20141208-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml (YARN-2923) or using script suggested by [~aw] (YARN-2729) ) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2925) Internal fields in LeafQueue access should be protected when accessed from FiCaSchedulerApp to calculate Headroom
[ https://issues.apache.org/jira/browse/YARN-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238155#comment-14238155 ] Craig Welch commented on YARN-2925: --- Hmm, there might be an even simpler approach - if we placed lock(s) (just a single lock, or potentially read/write) in the LeafQueue and then just held them around the final headroom calculation and the two locations where other changes occur (user comsumed +- and queue usedResources +-), all of which I believe occur in leaf queue, and then setup the lastClusterResource to be copied (inside the (write)lock), I think this would be resolved, and it would not be much of a change / much code. In fact, we would not need the queueresourceinfo at all, and could potentially drop the headroominfo as well. [~leftnoteasy] I think this might actually bethe simplest approach, Thoughts? Internal fields in LeafQueue access should be protected when accessed from FiCaSchedulerApp to calculate Headroom - Key: YARN-2925 URL: https://issues.apache.org/jira/browse/YARN-2925 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Attachments: YARN-2925.1.patch Upon YARN-2644, FiCaScheduler will calculation up-to-date headroom before sending back Allocation response to AM. Headroom calculation is happened in LeafQueue side, uses fields like used resource, etc. But it is not protected by any lock of LeafQueue, so it might be corrupted is someone else is editing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238157#comment-14238157 ] Wangda Tan commented on YARN-2495: -- [~Naganarasimha], I meant you can leave it empty in this patch and set it after you have completed script/conf-based patch, just add a TODO comment should be fine. Make sense? Allow admin specify labels from each NM (Distributed configuration) --- Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, YARN-2495.20141119-1.patch, YARN-2495.20141126-1.patch, YARN-2495.20141204-1.patch, YARN-2495.20141208-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml (YARN-2923) or using script suggested by [~aw] (YARN-2729) ) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238158#comment-14238158 ] Zhijie Shen commented on YARN-2517: --- [~ozawa], we may hay to hang on a bit about the async call implementation. Recently folks have some offline discussion around the timeline server next gen. Some architecture may be changed in the future, Would you please keep an eye on YARN-2928? Implement TimelineClientAsync - Key: YARN-2517 URL: https://issues.apache.org/jira/browse/YARN-2517 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2517.1.patch, YARN-2517.2.patch In some scenarios, we'd like to put timeline entities in another thread no to block the current one. It's good to have a TimelineClientAsync like AMRMClientAsync and NMClientAsync. It can buffer entities, put them in a separate thread, and have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
[ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238165#comment-14238165 ] Sangjin Lee commented on YARN-2928: --- Thanks [~vinodkv]! I'll post the design doc pretty soon (today or tomorrow). Application Timeline Server (ATS) next gen: phase 1 --- Key: YARN-2928 URL: https://issues.apache.org/jira/browse/YARN-2928 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be address. This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2762: - Attachment: YARN-2762.6.patch RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.6.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238176#comment-14238176 ] Rohith commented on YARN-2762: -- Updated the patch fixing review comment. Kindly review updated patch RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.6.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238183#comment-14238183 ] Ray Chiang commented on YARN-2284: -- Initial version of this fix needs all Configuration properties to exist within the .xml files. Find missing config options in YarnConfiguration and yarn-default.xml - Key: YARN-2284 URL: https://issues.apache.org/jira/browse/YARN-2284 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Attachments: YARN-2284-04.patch, YARN-2284-05.patch, YARN-2284-06.patch, YARN-2284-07.patch, YARN-2284-08.patch, YARN2284-01.patch, YARN2284-02.patch, YARN2284-03.patch YarnConfiguration has one set of properties. yarn-default.xml has another set of properties. Ideally, there should be an automatic way to find missing properties in either location. This is analogous to MAPREDUCE-5130, but for yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238184#comment-14238184 ] Ray Chiang commented on YARN-2910: -- Never mind, I misread the earlier JIRA history. I must have accidentally clicked on Assign to me while scrolling around. With the newest unit test and *without* the code fix (i.e. expecting failures), I'm seeing a failure rate around 70%. I think it would still be a good idea to increase the modifications to get the failure rate higher (as Tsuyoshi suggested earlier). I can get 10/10 failures with a value of 400 in the modify for loop. FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached file. We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2284: - Attachment: YARN-2284-09.patch Updated for latest Configuration variables. Find missing config options in YarnConfiguration and yarn-default.xml - Key: YARN-2284 URL: https://issues.apache.org/jira/browse/YARN-2284 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Attachments: YARN-2284-04.patch, YARN-2284-05.patch, YARN-2284-06.patch, YARN-2284-07.patch, YARN-2284-08.patch, YARN-2284-09.patch, YARN2284-01.patch, YARN2284-02.patch, YARN2284-03.patch YarnConfiguration has one set of properties. yarn-default.xml has another set of properties. Ideally, there should be an automatic way to find missing properties in either location. This is analogous to MAPREDUCE-5130, but for yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2927) InMemorySCMStore properties are inconsistent
[ https://issues.apache.org/jira/browse/YARN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238211#comment-14238211 ] Chris Trezzo commented on YARN-2927: Thanks [~rchiang] for the fix! InMemorySCMStore properties are inconsistent Key: YARN-2927 URL: https://issues.apache.org/jira/browse/YARN-2927 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Ray Chiang Assignee: Ray Chiang Labels: newbie, supportability Fix For: 2.7.0 Attachments: YARN-2927.001.patch, YARN-2927.002.patch I see these properties in the yarn-default.xml file: yarn.sharedcache.store.in-memory.check-period-mins yarn.sharedcache.store.in-memory.initial-delay-mins yarn.sharedcache.store.in-memory.staleness-period-mins YarnConfiguration looks like it's missing some properties: public static final String SHARED_CACHE_PREFIX = yarn.sharedcache.; public static final String SCM_STORE_PREFIX = SHARED_CACHE_PREFIX + store.; public static final String IN_MEMORY_STORE_PREFIX = SHARED_CACHE_PREFIX + in-memory.; public static final String IN_MEMORY_STALENESS_PERIOD_MINS = IN_MEMORY_STORE_PREFIX + staleness-period-mins; It looks like the definition for IN_MEMORY_STORE_PREFIX should be: public static final String IN_MEMORY_STORE_PREFIX = SCM_STORE_PREFIX + in-memory.; Just to be clear, there are properties that exist in yarn-default.xml that are effectively misspelled in the *Java* file, not the .xml file. This is similar to YARN-2461 and MAPREDUCE-6087. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238237#comment-14238237 ] Hadoop QA commented on YARN-2762: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685790/YARN-2762.6.patch against trunk revision 144da2e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 31 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6037//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6037//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6037//console This message is automatically generated. RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.6.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238255#comment-14238255 ] Mit Desai commented on YARN-2900: - [~zjshen], from the changes that the patch makes, only time that the NotFound is thrown is when there is no application|attempt|container that the client is asking for. I am not sure why the timelineserver throws some exception and we get a NotFound on the browser. Can you explain what was the test that you did here? Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2618: -- Attachment: YARN-2618-2.patch Thanks, [~kasha]. Update a new patch to fix the comments. The existing patch works well with FairScheduler. But for FifoScheduler and CapacityScheduler, it cannot avoid over-allocating disk resources. This is because both Fifo and Capacity only care memory capacity when assigning containers to nodes, and support over-consuming for cpu resources. [~jianhe], do u know any special reason why CapacityScheduler support over-consuming cpu resources? Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2683) registry config options: document and move to core-default
[ https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238303#comment-14238303 ] Hadoop QA commented on YARN-2683: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685776/HADOOP-10530-005.patch against trunk revision 57cb43b. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6039//console This message is automatically generated. registry config options: document and move to core-default -- Key: YARN-2683 URL: https://issues.apache.org/jira/browse/YARN-2683 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-10530-005.patch, YARN-2683-001.patch, YARN-2683-002.patch, YARN-2683-003.patch Original Estimate: 1h Remaining Estimate: 1h Add to {{yarn-site}} a page on registry configuration parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2892) Unable to get AMRMToken in unmanaged AM when using a secure cluster
[ https://issues.apache.org/jira/browse/YARN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238307#comment-14238307 ] Sevada Abraamyan commented on YARN-2892: I don't see this either. However, one thing I did notice is that with the patch we are now changing how ClientToAMToken is constructed as we are using the short name instead of full name. {code} @Override public ApplicationReport createAndGetApplicationReport(String clientUserName, boolean allowAccess) { if (UserGroupInformation.isSecurityEnabled()) { // get a token so the client can communicate with the app attempt // NOTE: token may be unavailable if the attempt is not running TokenClientToAMTokenIdentifier attemptClientToAMToken = this.currentAttempt.createClientToken(clientUserName); if (attemptClientToAMToken != null) { clientToAMToken = BuilderUtils.newClientToAMToken( attemptClientToAMToken.getIdentifier(), attemptClientToAMToken.getKind().toString(), attemptClientToAMToken.getPassword(), attemptClientToAMToken.getService().toString()); } ... {code} Unable to get AMRMToken in unmanaged AM when using a secure cluster --- Key: YARN-2892 URL: https://issues.apache.org/jira/browse/YARN-2892 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Sevada Abraamyan Assignee: Sevada Abraamyan Attachments: YARN-2892.patch, YARN-2892.patch, YARN-2892.patch An AMRMToken is retrieved from the ApplicationReport by the YarnClient. When the RM creates the ApplicationReport and sends it back to the client it makes a simple security check whether it should include the AMRMToken in the report (See createAndGetApplicationReport in RMAppImpl).This security check verifies that the user who submitted the original application is the same user who is requesting the ApplicationReport. If they are indeed the same user then it includes the AMRMToken, otherwise it does not include it. The problem arises from the fact that when an application is submitted, the RM saves the short username of the user who created the application (See submitApplication in ClientRmService). Afterwards when the ApplicationReport is requested, the system tries to match the full username of the requester against the previously stored short username. In a secure cluster using Kerberos this check fails because the principle is stripped from the username when we request a short username. So for example the short username might be Foo whereas the full username is f...@company.com Note: A very similar problem has been previously reported ([Yarn-2232|https://issues.apache.org/jira/browse/YARN-2232]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-405) Add command start-up time to environment of a container to track launch costs
[ https://issues.apache.org/jira/browse/YARN-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah resolved YARN-405. -- Resolution: Not a Problem Add command start-up time to environment of a container to track launch costs - Key: YARN-405 URL: https://issues.apache.org/jira/browse/YARN-405 Project: Hadoop YARN Issue Type: Improvement Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Minor Labels: container Attachments: YARN-405.1.patch For applications like MapReduce, jvm launch cost has always been considered a factor in performance. Adding some basic information into the environment will allow an application to track its startup costs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-510) Writing Yarn Applications documentation should be changed to signify use of of fully qualified paths when localizing resources
[ https://issues.apache.org/jira/browse/YARN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-510: - Assignee: (was: Hitesh Shah) Writing Yarn Applications documentation should be changed to signify use of of fully qualified paths when localizing resources -- Key: YARN-510 URL: https://issues.apache.org/jira/browse/YARN-510 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.0.0-alpha Reporter: Hitesh Shah Path jarPath = new Path(/Working_HDFS_DIR/+ appId +/+AM_JAR); fs.copyFromLocalFile(new Path(/local/src/AM.jar), jarPath); // VALIDATED jar is in HDFS under correct PATH FileStatus jarStatus = fs.getFileStatus(jarPath); LocalResource amJarRsrc = Records.newRecord(LocalResource.class); amJarRsrc.setType(LocalResourceType.FILE); amJarRsrc.setVisibility(LocalResourceVisibility.APPLICATION); amJarRsrc.setResource(ConverterUtils.getYarnUrlFromPath(jarPath)); amJarRsrc.setTimestamp(jarStatus.getModificationTime()); amJarRsrc.setSize(jarStatus.getLen()); localResources.put(AppMaster.jar, amJarRsrc); amContainer.setLocalResources(localResources); Error logs (nodeManager.log) INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1364219323374_0016 transitioned from INITING to RUNNING INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Got exception parsing AppMaster.jar and value resource {, port: -1, file: /Working_HDFS_DIR/application_1364219323374_0016/AM.jar, }, size: 13940, timestamp: 1364230436600, type: FILE, visibility: APPLICATION, 2013-03-25 17:53:57,391 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Failed to parse resource-request java.net.URISyntaxException: Expected scheme name at index 0: :///Working_HDFS_DIR/application_1364219323374_0016/AM.jar at java.net.URI$Parser.fail(URI.java:2810) at java.net.URI$Parser.failExpecting(URI.java:2816) at java.net.URI$Parser.parse(URI.java:3008) at java.net.URI.init(URI.java:735) at org.apache.hadoop.yarn.util.ConverterUtils.getPathFromYarnURL(ConverterUtils.java:70) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.init(LocalResourceRequest.java:46) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:501) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:472) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:382) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMa -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-436) Document how to use DistributedShell yarn application
[ https://issues.apache.org/jira/browse/YARN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-436: - Assignee: (was: Hitesh Shah) Document how to use DistributedShell yarn application - Key: YARN-436 URL: https://issues.apache.org/jira/browse/YARN-436 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238332#comment-14238332 ] Hadoop QA commented on YARN-2284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685803/YARN-2284-09.patch against trunk revision ffe942b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 74 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6038//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6038//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6038//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6038//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6038//console This message is automatically generated. Find missing config options in YarnConfiguration and yarn-default.xml - Key: YARN-2284 URL: https://issues.apache.org/jira/browse/YARN-2284 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Attachments: YARN-2284-04.patch, YARN-2284-05.patch, YARN-2284-06.patch, YARN-2284-07.patch, YARN-2284-08.patch, YARN-2284-09.patch, YARN2284-01.patch, YARN2284-02.patch, YARN2284-03.patch YarnConfiguration has one set of properties. yarn-default.xml has another set of properties. Ideally, there should be an automatic way to find missing properties in either location. This is analogous to MAPREDUCE-5130, but for yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238336#comment-14238336 ] Hadoop QA commented on YARN-2495: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685787/YARN-2495.20141208-1.patch against trunk revision 144da2e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 38 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6036//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6036//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6036//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6036//console This message is automatically generated. Allow admin specify labels from each NM (Distributed configuration) --- Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, YARN-2495.20141119-1.patch, YARN-2495.20141126-1.patch, YARN-2495.20141204-1.patch, YARN-2495.20141208-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml (YARN-2923) or using script suggested by [~aw] (YARN-2729) ) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238356#comment-14238356 ] Ray Chiang commented on YARN-2284: -- RE: Findbugs. I don't see the any of the findbugs warnings in the code added/deleted by this patch. Find missing config options in YarnConfiguration and yarn-default.xml - Key: YARN-2284 URL: https://issues.apache.org/jira/browse/YARN-2284 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: supportability Attachments: YARN-2284-04.patch, YARN-2284-05.patch, YARN-2284-06.patch, YARN-2284-07.patch, YARN-2284-08.patch, YARN-2284-09.patch, YARN2284-01.patch, YARN2284-02.patch, YARN2284-03.patch YarnConfiguration has one set of properties. yarn-default.xml has another set of properties. Ideally, there should be an automatic way to find missing properties in either location. This is analogous to MAPREDUCE-5130, but for yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238357#comment-14238357 ] Wangda Tan commented on YARN-2762: -- Looks good, thanks for update, test failure shouldn't related, but could you take a look at findbugs warning? RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.6.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2571) RM to support YARN registry
[ https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2571: - Attachment: YARN-2571-009.patch patch -009 in sync with trunk RM to support YARN registry Key: YARN-2571 URL: https://issues.apache.org/jira/browse/YARN-2571 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2571-001.patch, YARN-2571-002.patch, YARN-2571-003.patch, YARN-2571-005.patch, YARN-2571-007.patch, YARN-2571-008.patch, YARN-2571-009.patch The RM needs to (optionally) integrate with the YARN registry: # startup: create the /services and /users paths with system ACLs (yarn, hdfs principals) # app-launch: create the user directory /users/$username with the relevant permissions (CRD) for them to create subnodes. # attempt, container, app completion: remove service records with the matching persistence and ID -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2925) Internal fields in LeafQueue access should be protected when accessed from FiCaSchedulerApp to calculate Headroom
[ https://issues.apache.org/jira/browse/YARN-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238365#comment-14238365 ] Wangda Tan commented on YARN-2925: -- [~cwelch], Thanks for your comments, your suggestion makes sense to me, I will: - Drop the existing QueueResourceInfo implementation, and will do refactoring in future patch - Add a fine grain lock only for headroom computing to resolve both consistent and stale issue, will include user consumed resource and used resource. I suggest to use read/write to achieve better performance. Any thoughts? Will work on a patch later. Thanks, Internal fields in LeafQueue access should be protected when accessed from FiCaSchedulerApp to calculate Headroom - Key: YARN-2925 URL: https://issues.apache.org/jira/browse/YARN-2925 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Attachments: YARN-2925.1.patch Upon YARN-2644, FiCaScheduler will calculation up-to-date headroom before sending back Allocation response to AM. Headroom calculation is happened in LeafQueue side, uses fields like used resource, etc. But it is not protected by any lock of LeafQueue, so it might be corrupted is someone else is editing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
Anubhav Dhoot created YARN-2931: --- Summary: PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2931: Description: When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} was: When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at
[jira] [Assigned] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-2931: --- Assignee: Anubhav Dhoot PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2931: Attachment: YARN-2931.001.patch Let PublicLocalizer also initialize the local directories similar to LocalizerRunner PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238480#comment-14238480 ] Karthik Kambatla commented on YARN-2931: Initially in the description, from Anubhav: Instead we can have PublicLocalizer not depend on this and also call getInitializedLocalDirs so it can handle initialization on its own similar to non public localization PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2892) Unable to get AMRMToken in unmanaged AM when using a secure cluster
[ https://issues.apache.org/jira/browse/YARN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238520#comment-14238520 ] Sevada Abraamyan commented on YARN-2892: On second thought I think [~djp] was referring directly to the code I referenced above. Since we'd rather not modify the public interface of RMApp, maybe we should continue passing in the full username to _createAndGetApplicationReport_ and prior to AMRM security check we can use this full username to construct a short username. It seems a bit hacky but I'm not sure how else we can avoid not breaking the public interface. The easiest way I can see doing this is by using something like the following: {code} UserGroupInformation remoteUser = UserGroupInformation.getRemoteUser(clientUserName); String shortUsername = remoteUser.getShortUsername(); {code} Another solution could be to do the following: {code} //if security is set to kerberos... HadoopKerberosName kbName = new HadoopKerberosName(clientUserName); String shortUsername = kbName.getShortUsername() {code} The first solution is a bit strange but looks more attractive to me as it allows _RMAppImpl_ to stay agnostic to the underlying security framework. Any suggestions? Unable to get AMRMToken in unmanaged AM when using a secure cluster --- Key: YARN-2892 URL: https://issues.apache.org/jira/browse/YARN-2892 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Sevada Abraamyan Assignee: Sevada Abraamyan Attachments: YARN-2892.patch, YARN-2892.patch, YARN-2892.patch An AMRMToken is retrieved from the ApplicationReport by the YarnClient. When the RM creates the ApplicationReport and sends it back to the client it makes a simple security check whether it should include the AMRMToken in the report (See createAndGetApplicationReport in RMAppImpl).This security check verifies that the user who submitted the original application is the same user who is requesting the ApplicationReport. If they are indeed the same user then it includes the AMRMToken, otherwise it does not include it. The problem arises from the fact that when an application is submitted, the RM saves the short username of the user who created the application (See submitApplication in ClientRmService). Afterwards when the ApplicationReport is requested, the system tries to match the full username of the requester against the previously stored short username. In a secure cluster using Kerberos this check fails because the principle is stripped from the username when we request a short username. So for example the short username might be Foo whereas the full username is f...@company.com Note: A very similar problem has been previously reported ([Yarn-2232|https://issues.apache.org/jira/browse/YARN-2232]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238529#comment-14238529 ] Hadoop QA commented on YARN-2931: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685852/YARN-2931.001.patch against trunk revision 6c5bbd7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1219 javac compiler warnings (more than the trunk's current 1217 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6040//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6040//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6040//console This message is automatically generated. PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
Eric Payne created YARN-2932: Summary: Add entry for preemption setting to queue status screen and startup/refresh logging Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reassigned YARN-2932: Assignee: Eric Payne Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2931: Attachment: YARN-2931.002.patch Fixed javac warnings PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2932) Add entry for preemption setting to queue status screen and startup/refresh logging
[ https://issues.apache.org/jira/browse/YARN-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238582#comment-14238582 ] Wangda Tan commented on YARN-2932: -- Thanks for raising this up [~eepayne], it is a good adding. IIRC, YARN-2056 putting the disable preemption configuration for queue code in ProportionalCapacityPreemptionPolicy instead of putting them into CapacitySchedulerConfiguration. But after read this proposal, I think we should move them to CapacitySchedulerConfiguration, and getIsPreemptionDisabled should be a method of CSQueue interface, thoughts? Add entry for preemption setting to queue status screen and startup/refresh logging --- Key: YARN-2932 URL: https://issues.apache.org/jira/browse/YARN-2932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.7.0 Reporter: Eric Payne Assignee: Eric Payne YARN-2056 enables the ability to turn preemption on or off on a per-queue level. This JIRA will provide the preemption status for each queue in the {{HOST:8088/cluster/scheduler}} UI and in the RM log during startup/queue refresh. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2900: Attachment: YARN-2900.patch Attaching the patch that addresses the the NFE and indenting. I'll wait for your response on the IllegalStateException Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2900: Attachment: YARN-2900.patch Refining patch. Missed to remove unused import Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238619#comment-14238619 ] Zhijie Shen commented on YARN-2900: --- bq. I am not sure why the timelineserver throws some exception and we get a NotFound on the browser. Can you explain what was the test that you did here? What i did: 1. Start the timeline server while system metrics publisher is enabled for RM. 2. Submit a MR example job. 3. Type {{http://localhost:8188/ws/v1/applicationhistory/apps/application_1417818619773_0001?user.name=zshen}} in browser, and check the output, which is right. 4. Type {{http://localhost:8188/ws/v1/applicationhistory/apps/application_1417818619773_0002?user.name=zshen}}, and look for NOT_FOUND message. However no response at all. And I see the aforementioned exception in the timeline server log. I applied this patch on trunk, and I could reproduce this issue. Undoing this patch, {{http://localhost:8188/ws/v1/applicationhistory/apps/application_1417818619773_0002?user.name=zshen}} will return Internal Server Error (500), which is the expected current behavior. Did you have a chance to reproduce it at ur side? Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2837: -- Attachment: YARN-2837.5.patch Do two more things in the new patch: 1. Correct the logic of storing version by differentiating the cases of create a new state store and the existing state store. It seems that LeveldbTimelineStore needs to be fixed too. Let's treat it as a separate issue. 2. Like RMDelegationTokenIdentifierData, create a TimelineDelegationTokenIndentifierData to wrap all fields to be serialized into leveldb for better compatibility if we add more fields in the future. Timeline server needs to recover the timeline DT when restarting Key: YARN-2837 URL: https://issues.apache.org/jira/browse/YARN-2837 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2837.1.patch, YARN-2837.2.patch, YARN-2837.3.patch, YARN-2837.4.patch, YARN-2837.5.patch Timeline server needs to recover the stateful information when restarting as RM/NM/JHS does now. So far the stateful information only includes the timeline DT. Without recovery, the timeline DT of the existing YARN apps is not long valid, and cannot be renewed any more after the timeline server is restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)
[ https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238675#comment-14238675 ] Hadoop QA commented on YARN-2900: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685879/YARN-2900.patch against trunk revision ddffcd8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6041//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6041//console This message is automatically generated. Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500) --- Key: YARN-2900 URL: https://issues.apache.org/jira/browse/YARN-2900 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Mit Desai Attachments: YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222) at org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) at org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218) ... 59 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2920) CapacityScheduler should be notified when labels on nodes changed
[ https://issues.apache.org/jira/browse/YARN-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2920: - Attachment: YARN-2920.2.patch Updated patch, dropped some unnecessary refactoring code which will cause deadlock (tracked by YARN-2925). Resolved UT failures. CapacityScheduler should be notified when labels on nodes changed - Key: YARN-2920 URL: https://issues.apache.org/jira/browse/YARN-2920 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2920.1.patch, YARN-2920.2.patch Currently, labels on nodes changes will only be handled by RMNodeLabelsManager, but that is not enough upon labels on nodes changes: - Scheduler should be able to do take actions to running containers. (Like kill/preempt/do-nothing) - Used / available capacity in scheduler should be updated for future planning. We need add a new event to pass such updates to scheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238683#comment-14238683 ] Hadoop QA commented on YARN-2931: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685860/YARN-2931.002.patch against trunk revision ddffcd8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 7 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6042//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6042//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6042//console This message is automatically generated. PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA
[jira] [Commented] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238684#comment-14238684 ] Hadoop QA commented on YARN-2837: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685889/YARN-2837.5.patch against trunk revision ddffcd8. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6043//console This message is automatically generated. Timeline server needs to recover the timeline DT when restarting Key: YARN-2837 URL: https://issues.apache.org/jira/browse/YARN-2837 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2837.1.patch, YARN-2837.2.patch, YARN-2837.3.patch, YARN-2837.4.patch, YARN-2837.5.patch Timeline server needs to recover the stateful information when restarting as RM/NM/JHS does now. So far the stateful information only includes the timeline DT. Without recovery, the timeline DT of the existing YARN apps is not long valid, and cannot be renewed any more after the timeline server is restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238704#comment-14238704 ] Wangda Tan commented on YARN-2618: -- [~ywskycn], Capacity Scheduler already support multi-dimension resource by DominateResourceCalculator and it should work when DRC updated to support disk. The statement is not true: bq. This is because both Fifo and Capacity only care memory capacity when assigning containers to nodes See {{CapacitySchedulerConfiguration.getResourceCalculator}}. Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2837: -- Attachment: (was: YARN-2837.5.patch) Timeline server needs to recover the timeline DT when restarting Key: YARN-2837 URL: https://issues.apache.org/jira/browse/YARN-2837 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2837.1.patch, YARN-2837.2.patch, YARN-2837.3.patch, YARN-2837.4.patch Timeline server needs to recover the stateful information when restarting as RM/NM/JHS does now. So far the stateful information only includes the timeline DT. Without recovery, the timeline DT of the existing YARN apps is not long valid, and cannot be renewed any more after the timeline server is restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2837: -- Attachment: YARN-2837.5.patch Timeline server needs to recover the timeline DT when restarting Key: YARN-2837 URL: https://issues.apache.org/jira/browse/YARN-2837 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2837.1.patch, YARN-2837.2.patch, YARN-2837.3.patch, YARN-2837.4.patch, YARN-2837.5.patch Timeline server needs to recover the stateful information when restarting as RM/NM/JHS does now. So far the stateful information only includes the timeline DT. Without recovery, the timeline DT of the existing YARN apps is not long valid, and cannot be renewed any more after the timeline server is restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2931: Attachment: YARN-2931.002.patch PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238725#comment-14238725 ] Anubhav Dhoot commented on YARN-2931: - Findbugs donot seem related to the patch. Uploading again to retrigger PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238734#comment-14238734 ] bc Wong commented on YARN-2931: --- Thanks for the fix! Some nits. ResourceLocalizationService.java * Instead of commenting out code, would just remove it. TestResourceLocalizationService.java * L950: Remove code that commented out. PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238753#comment-14238753 ] Wei Yan commented on YARN-2618: --- Thanks for pointing out, [~leftnoteasy]. I'll check that and update testcases for Capacity. Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2618-1.patch, YARN-2618-2.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238763#comment-14238763 ] Hadoop QA commented on YARN-2837: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685897/YARN-2837.5.patch against trunk revision ddffcd8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 7 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6045//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6045//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6045//console This message is automatically generated. Timeline server needs to recover the timeline DT when restarting Key: YARN-2837 URL: https://issues.apache.org/jira/browse/YARN-2837 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2837.1.patch, YARN-2837.2.patch, YARN-2837.3.patch, YARN-2837.4.patch, YARN-2837.5.patch Timeline server needs to recover the stateful information when restarting as RM/NM/JHS does now. So far the stateful information only includes the timeline DT. Without recovery, the timeline DT of the existing YARN apps is not long valid, and cannot be renewed any more after the timeline server is restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238770#comment-14238770 ] Hadoop QA commented on YARN-2931: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685898/YARN-2931.002.patch against trunk revision ddffcd8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6046//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6046//console This message is automatically generated. PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2920) CapacityScheduler should be notified when labels on nodes changed
[ https://issues.apache.org/jira/browse/YARN-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238784#comment-14238784 ] Hadoop QA commented on YARN-2920: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685892/YARN-2920.2.patch against trunk revision ddffcd8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1218 javac compiler warnings (more than the trunk's current 1217 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 2 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6044//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6044//artifact/patchprocess/patchReleaseAuditProblems.txt Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6044//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6044//console This message is automatically generated. CapacityScheduler should be notified when labels on nodes changed - Key: YARN-2920 URL: https://issues.apache.org/jira/browse/YARN-2920 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2920.1.patch, YARN-2920.2.patch Currently, labels on nodes changes will only be handled by RMNodeLabelsManager, but that is not enough upon labels on nodes changes: - Scheduler should be able to do take actions to running containers. (Like kill/preempt/do-nothing) - Used / available capacity in scheduler should be updated for future planning. We need add a new event to pass such updates to scheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2930) TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8
[ https://issues.apache.org/jira/browse/YARN-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2930: Assignee: Wangda Tan (was: Rohith) TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8 Key: YARN-2930 URL: https://issues.apache.org/jira/browse/YARN-2930 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Wangda Tan Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/31/console : {code} testRMRestartRecoveringNodeLabelManager[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.136 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) testRMRestartRecoveringNodeLabelManager[1](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.081 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2930) TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8
[ https://issues.apache.org/jira/browse/YARN-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238818#comment-14238818 ] Wangda Tan commented on YARN-2930: -- [~rohithsharma], I've looked into this, found the root cause, so I took over since it causes other Jenkins job failure. It is caused by some other test(s) write node label to FS, and TestRMRestart.testRMRestartRecoveringNodeLabelManager loaded previously wrote node label store from FS when starting. I've done a patch will allocate a random temp directory for writing node labels data, and will cleanup when JVM exit. Please let me know your comments. Thanks, Wangda TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8 Key: YARN-2930 URL: https://issues.apache.org/jira/browse/YARN-2930 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Rohith Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/31/console : {code} testRMRestartRecoveringNodeLabelManager[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.136 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) testRMRestartRecoveringNodeLabelManager[1](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.081 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2930) TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8
[ https://issues.apache.org/jira/browse/YARN-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2930: - Attachment: YARN-2930.1.patch TestRMRestart#testRMRestartRecoveringNodeLabelManager sometimes fails against Java 8 Key: YARN-2930 URL: https://issues.apache.org/jira/browse/YARN-2930 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Wangda Tan Priority: Minor Attachments: YARN-2930.1.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/31/console : {code} testRMRestartRecoveringNodeLabelManager[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.136 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) testRMRestartRecoveringNodeLabelManager[1](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 0.081 sec FAILURE! java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartRecoveringNodeLabelManager(TestRMRestart.java:2100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238838#comment-14238838 ] Karthik Kambatla commented on YARN-2910: Here is the deadlock Wilfred was mentioning: {noformat} FairSchedulerContinuousScheduling: at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:553) - waiting to lock 0x0007f6bc8f58 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:769) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:228) - locked 0x0007f6b5ec00 (a java.util.Collections$SynchronizedRandomAccessList) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1072) - locked 0x0007f68f25e8 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1005) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:280) Thread-434: at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:152) - waiting to lock 0x0007f6b5ec00 (a java.util.Collections$SynchronizedRandomAccessList) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) - locked 0x0007f6bc8f58 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:939) - locked 0x0007f6bc8f58 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3509) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {noformat} FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238884#comment-14238884 ] Karthik Kambatla commented on YARN-2910: Looking around, we don't need the synchronization for FSAppAttempt#getHeadroom. That and changing the locking to use read-write locks should get us a long way towards avoiding this situation. Also, if we are locking on each access, we should be able to drop the use of Collections.synchronizedList. FSLeafQueue can throw ConcurrentModificationException - Key: YARN-2910 URL: https://issues.apache.org/jira/browse/YARN-2910 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Attachments: FSLeafQueue_concurrent_exception.txt, YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, YARN-2910.4.patch, YARN-2910.patch The list that maintains the runnable and the non runnable apps are a standard ArrayList but there is no guarantee that it will only be manipulated by one thread in the system. This can lead to the following exception: {noformat} 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) at java.util.ArrayList$Itr.next(ArrayList.java:831) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) {noformat} Full stack trace in the attached file. We should guard against that by using a thread safe version from java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2931: Attachment: YARN-2931.003.patch Addressed comments and made test more robust to verifying the fix PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch, YARN-2931.003.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2762) RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM
[ https://issues.apache.org/jira/browse/YARN-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238902#comment-14238902 ] Rohith commented on YARN-2762: -- For all the syserr and sysout , findbug warning has been generated includes other class files also. I doubt that any findbug rule has been modified? RMAdminCLI node-labels-related args should be trimmed and checked before sending to RM -- Key: YARN-2762 URL: https://issues.apache.org/jira/browse/YARN-2762 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Minor Attachments: YARN-2762.1.patch, YARN-2762.2.patch, YARN-2762.2.patch, YARN-2762.3.patch, YARN-2762.4.patch, YARN-2762.5.patch, YARN-2762.6.patch, YARN-2762.patch All NodeLabel args validation's are done at server side. The same can be done at RMAdminCLI so that unnecessary RPC calls can be avoided. And for the input such as x,y,,z,, no need to add empty string instead can be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2917) Potential deadlock in AsyncDispatcher when system.exit called in AsyncDispatcher#dispatch and AsyscDispatcher#serviceStop from shutdown hook
[ https://issues.apache.org/jira/browse/YARN-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238905#comment-14238905 ] Rohith commented on YARN-2917: -- Hi [~kasha], [~jianhe], [~vinodkv] Kindly review the analysis and patch. This issue is causing RM to hang. Potential deadlock in AsyncDispatcher when system.exit called in AsyncDispatcher#dispatch and AsyscDispatcher#serviceStop from shutdown hook Key: YARN-2917 URL: https://issues.apache.org/jira/browse/YARN-2917 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2917.patch I encoutered scenario where RM hanged while shutting down and keep on logging {{2014-12-03 19:32:44,283 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Waiting for AsyncDispatcher to drain.}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2931: Attachment: YARN-2931.004.patch PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch, YARN-2931.003.patch, YARN-2931.004.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2931: --- Priority: Critical (was: Major) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch, YARN-2931.003.patch, YARN-2931.004.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238919#comment-14238919 ] Karthik Kambatla commented on YARN-2931: +1, pending Jenkins. PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner -- Key: YARN-2931 URL: https://issues.apache.org/jira/browse/YARN-2931 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2931.001.patch, YARN-2931.002.patch, YARN-2931.002.patch, YARN-2931.003.patch, YARN-2931.004.patch When the data directory is cleaned up and NM is started with existing recovery state, because of YARN-90, it will not recreate the local dirs. This causes a PublicLocalizer to fail until getInitializedLocalDirs is called due to some LocalizeRunner for private localization. Example error {noformat} 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs:/blah machine:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, 1417589819618, FILE, null },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-02 22:57:32,629 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1417589109512_0001_02_03 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)