[jira] [Commented] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
[ https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998269#comment-13998269 ] Karthik Kambatla commented on YARN-2062: I propose having a dummy invalid transition in RMNodeImpl to capture all the invalid transitions. We can just log these at DEBUG level. Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover --- Key: YARN-2062 URL: https://issues.apache.org/jira/browse/YARN-2062 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla On busy clusters, we see several {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events invoked against NEW nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996444#comment-13996444 ] Maysam Yabandeh commented on YARN-1969: --- [~kkambatl], you are right. The title of the jira is misleading. The jira description talks about the jobs that are about to finish and their estimated endtime but the title indicates deadline. I guess the confusion came from the name earliest deadline first algorithm cited in the jira description. What we had in mind was a variation of the algorithm that a) takes other parameters into account, b) it is not necessarily tied to deadline. Fair Scheduler: Add policy for Earliest Deadline First -- Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2011) Fix typo and warning in TestLeafQueue
[ https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998226#comment-13998226 ] Hudson commented on YARN-2011: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1753 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1753/]) YARN-2011. Fix typo and warning in TestLeafQueue (Contributed by Chen He) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593804) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java Fix typo and warning in TestLeafQueue - Key: YARN-2011 URL: https://issues.apache.org/jira/browse/YARN-2011 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Chen He Assignee: Chen He Priority: Trivial Fix For: 2.5.0 Attachments: YARN-2011-v2.patch, YARN-2011.patch a.assignContainers(clusterResource, node_0); assertEquals(2*GB, a.getUsedResources().getMemory()); assertEquals(2*GB, app_0.getCurrentConsumption().getMemory()); assertEquals(0*GB, app_1.getCurrentConsumption().getMemory()); assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G // Again one to user_0 since he hasn't exceeded user limit yet a.assignContainers(clusterResource, node_0); assertEquals(3*GB, a.getUsedResources().getMemory()); assertEquals(2*GB, app_0.getCurrentConsumption().getMemory()); assertEquals(1*GB, app_1.getCurrentConsumption().getMemory()); assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998231#comment-13998231 ] Hudson commented on YARN-2016: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1753 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1753/]) YARN-2016. Fix a bug in GetApplicationsRequestPBImpl to add the missed fields to proto. Contributed by Junping Du (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594085) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Yarn getApplicationRequest start time range is not honored -- Key: YARN-2016 URL: https://issues.apache.org/jira/browse/YARN-2016 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Venkat Ranganathan Assignee: Junping Du Fix For: 2.4.1 Attachments: YARN-2016.patch, YarnTest.java When we query for the previous applications by creating an instance of GetApplicationsRequest and setting the start time range and application tag, we see that the start range provided is not honored and all applications with the tag are returned Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2053: - Attachment: YARN-2053.patch Attached a new patch with UT according to [~jianhe]'s suggestion. Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Attachments: YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk
[ https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-766: Summary: TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk (was: TestNodeManagerShutdown should use Shell to form the output path) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk Key: YARN-766 URL: https://issues.apache.org/jira/browse/YARN-766 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.1.0-beta Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Minor Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt File scriptFile = new File(tmpDir, scriptFile.sh); should be replaced with File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile); to match trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993020#comment-13993020 ] Jason Lowe commented on YARN-2034: -- While updating it we may also want to clarify that it is a target retention size that only includes resources with PUBLIC and PRIVATE visibility and excludes resources with APPLICATION visibility. Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Priority: Minor The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-893) Capacity scheduler allocates vcores to containers but does not report it in headroom
[ https://issues.apache.org/jira/browse/YARN-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998657#comment-13998657 ] Tsuyoshi OZAWA commented on YARN-893: - Thanks for updating a patch, [~kj-ki]. It looks very cleaner. Great job. One additional point: Can we add unit tests for utility methods in DefaultResourceCalculator? Capacity scheduler allocates vcores to containers but does not report it in headroom Key: YARN-893 URL: https://issues.apache.org/jira/browse/YARN-893 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta, 2.3.0 Reporter: Bikas Saha Assignee: Kenji Kikushima Attachments: YARN-893-2.patch, YARN-893.patch In non-DRF mode, it reports 0 vcores in the headroom but it allocates 1 vcore to containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998524#comment-13998524 ] Karthik Kambatla commented on YARN-2061: We assume that the Log level is at least INFO, so we add *Enabled only for TRACE and DEBUG levels. Revisit logging levels in ZKRMStateStore - Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Labels: newbie ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1957) ProportionalCapacitPreemptionPolicy handling of corner cases...
[ https://issues.apache.org/jira/browse/YARN-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999005#comment-13999005 ] Hudson commented on YARN-1957: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1957. Consider the max capacity of the queue when computing the ideal capacity for preemption. Contributed by Carlo Curino (cdouglas: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594414) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java ProportionalCapacitPreemptionPolicy handling of corner cases... --- Key: YARN-1957 URL: https://issues.apache.org/jira/browse/YARN-1957 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler, preemption Fix For: 3.0.0, 2.5.0, 2.4.1 Attachments: YARN-1957.patch, YARN-1957.patch, YARN-1957_test.patch The current version of ProportionalCapacityPreemptionPolicy should be improved to deal with the following two scenarios: 1) when rebalancing over-capacity allocations, it potentially preempts without considering the maxCapacity constraints of a queue (i.e., preempting possibly more than strictly necessary) 2) a zero capacity queue is preempted even if there is no demand (coherent with old use of zero-capacity to disabled queues) The proposed patch fixes both issues, and introduce few new test cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests
[ https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998912#comment-13998912 ] Bikas Saha commented on YARN-2027: -- Yes. If strict node locality is needed then the rack should not be specified. If the rack is specified then it will allow relaxing locality up to the rack but no further. YARN ignores host-specific resource requests Key: YARN-2027 URL: https://issues.apache.org/jira/browse/YARN-2027 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.0 Environment: RHEL 6.1 YARN 2.4 Reporter: Chris Riccomini YARN appears to be ignoring host-level ContainerRequests. I am creating a container request with code that pretty closely mirrors the DistributedShell code: {code} protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) { info(Requesting %d container(s) with %dmb of memory format (containers, memMb)) val capability = Records.newRecord(classOf[Resource]) val priority = Records.newRecord(classOf[Priority]) priority.setPriority(0) capability.setMemory(memMb) capability.setVirtualCores(cpuCores) // Specifying a host in the String[] host parameter here seems to do nothing. Setting relaxLocality to false also doesn't help. (0 until containers).foreach(idx = amClient.addContainerRequest(new ContainerRequest(capability, null, null, priority))) } {code} When I run this code with a specific host in the ContainerRequest, YARN does not honor the request. Instead, it puts the container on an arbitrary host. This appears to be true for both the FifoScheduler and the CapacityScheduler. Currently, we are running the CapacityScheduler with the following settings: {noformat} configuration property nameyarn.scheduler.capacity.maximum-applications/name value1/value description Maximum number of applications that can be pending and running. /description /property property nameyarn.scheduler.capacity.maximum-am-resource-percent/name value0.1/value description Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. /description /property property nameyarn.scheduler.capacity.resource-calculator/name valueorg.apache.hadoop.yarn.util.resource.DefaultResourceCalculator/value description The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. /description /property property nameyarn.scheduler.capacity.root.queues/name valuedefault/value description The queues at the this level (root is the root queue). /description /property property nameyarn.scheduler.capacity.root.default.capacity/name value100/value descriptionSamza queue target capacity./description /property property nameyarn.scheduler.capacity.root.default.user-limit-factor/name value1/value description Default queue user limit a percentage from 0.0 to 1.0. /description /property property nameyarn.scheduler.capacity.root.default.maximum-capacity/name value100/value description The maximum capacity of the default queue. /description /property property nameyarn.scheduler.capacity.root.default.state/name valueRUNNING/value description The state of the default queue. State can be one of RUNNING or STOPPED. /description /property property nameyarn.scheduler.capacity.root.default.acl_submit_applications/name value*/value description The ACL of who can submit jobs to the default queue. /description /property property nameyarn.scheduler.capacity.root.default.acl_administer_queue/name value*/value description The ACL of who can administer jobs on the default queue. /description /property property nameyarn.scheduler.capacity.node-locality-delay/name value40/value description Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in one rack which is 40. /description /property /configuration {noformat} Digging into the code a bit (props to [~jghoman] for finding this), we have a theory as to
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998825#comment-13998825 ] Wangda Tan commented on YARN-2017: -- LGTM, +1 (non-binding). Please kick off Jenkins building. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2053: - Attachment: YARN-2053.patch Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998887#comment-13998887 ] Chen He commented on YARN-1612: --- ping Change Fair Scheduler to not disable delay scheduling by default Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998623#comment-13998623 ] Tsuyoshi OZAWA commented on YARN-2061: -- The logging in removeRMDelegationTokenState()/updateRMDelegationTokenAndSequenceNumberInternal()/removeRMDTMasterKeyState() can be for RACE and DEBUG levels. {code} LOG.info(Done Loading applications from ZK state store); {code} About this log, how about moving this to the tail of loadRMAppState()? Revisit logging levels in ZKRMStateStore - Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Labels: newbie ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup
[ https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994001#comment-13994001 ] Hadoop QA commented on YARN-2036: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644162/YARN2036-02.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3729//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3729//console This message is automatically generated. Document yarn.resourcemanager.hostname in ClusterSetup -- Key: YARN-2036 URL: https://issues.apache.org/jira/browse/YARN-2036 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Fix For: 2.5.0 Attachments: YARN2036-01.patch, YARN2036-02.patch ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people should just be able to use that directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999139#comment-13999139 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645051/YARN-1365.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3746//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3746//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2053: -- Attachment: YARN-2053.patch Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999094#comment-13999094 ] Jian He commented on YARN-2065: --- Looked at the exception posted in SLIDER-34, the problem is that AM can get new containers from RM, but cannot launch the containers on NM because of the following method. The token is generated with the previous container's attempt Id, instead of the current attemptId. And NM is checking the attemptId from NMToken against the attemptId from the container. {code} public NMToken createAndGetNMToken(String applicationSubmitter, ApplicationAttemptId appAttemptId, Container container) { try { this.readLock.lock(); HashSetNodeId nodeSet = this.appAttemptToNodeKeyMap.get(appAttemptId); NMToken nmToken = null; if (nodeSet != null) { if (!nodeSet.contains(container.getNodeId())) { LOG.info(Sending NMToken for nodeId : + container.getNodeId() + for container : + container.getId()); Token token = createNMToken(**container.getId().getApplicationAttemptId()**, container.getNodeId(), applicationSubmitter); nmToken = NMToken.newInstance(container.getNodeId(), token); nodeSet.add(container.getNodeId()); } } return nmToken; } finally { this.readLock.unlock(); } } {code} Changing this method will fix this problem. But another problem is that ContainerMangerImpl#authorizeGetAndStopContainerRequest also requires the previous NMToken to talk to the previous container and current NMToken to talk with current container. Luckily, it's now not throwing exception but just log error messages. we also need to change the NM side to check against the applicationId rather than attemptId. AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999258#comment-13999258 ] Tsuyoshi OZAWA commented on YARN-1514: -- I'll make these parameters configurable: 1. number of applications 2. number of application attempts 3. ZK connection configuration(host:port) A result message with the WIP patch is as follows: {quote} ZKRMStateStore takes 12644 msec to loadState. {quote} Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.5.0 Attachments: YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998941#comment-13998941 ] Mayank Bansal commented on YARN-2055: - YARN-2022 is for avoiding killing AM however this issue more like how we are launching AM after preemption as there would be situations where you get some capacity for one heart beat and then again that capacity is reclaimed by other queue and then again AM will be killed and job will be failed. Based on the comments of YARN-2022 i dont see this case have been handeled there. Thanks, Mayank Preemption: Jobs are failing due to AMs are getting launched and killed multiple times -- Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting
[ https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998992#comment-13998992 ] zhihai xu commented on YARN-1569: - Hi, I want to work on this issue(YARN-1569), Can someone assign this issue to me? thanks zhihai For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting - Key: YARN-1569 URL: https://issues.apache.org/jira/browse/YARN-1569 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Junping Du Priority: Minor Labels: newbie As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should always check appropriate type before casting. handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so far (no bug there now) but should be improved as FairScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions
[ https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998195#comment-13998195 ] Hudson commented on YARN-1987: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1779 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1779/]) YARN-1987. Wrapper for leveldb DBIterator to aid in handling database exceptions. (Jason Lowe via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593757) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/LeveldbIterator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils/TestLeveldbIterator.java Wrapper for leveldb DBIterator to aid in handling database exceptions - Key: YARN-1987 URL: https://issues.apache.org/jira/browse/YARN-1987 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1987.patch, YARN-1987v2.patch Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a utility wrapper around leveldb's DBIterator to translate the raw RuntimeExceptions it can throw into DBExceptions to make it easier to handle database errors while iterating. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998704#comment-13998704 ] Steve Loughran commented on YARN-941: - We've been doing AM restart, and seen some token renewal problems already -these may be worth fixing first: SLIDER-46, SLIDER-34, with YARN-side issues of YARN-2065, and YARN-2053. Fixing those probably comes before working on a patch here RM Should have a way to update the tokens it has for a running application -- Key: YARN-941 URL: https://issues.apache.org/jira/browse/YARN-941 Project: Hadoop YARN Issue Type: Sub-task Reporter: Robert Joseph Evans Assignee: Xuan Gong When an application is submitted to the RM it includes with it a set of tokens that the RM will renew on behalf of the application, that will be passed to the AM when the application is launched, and will be used when launching the application to access HDFS to download files on behalf of the application. For long lived applications/services these tokens can expire, and then the tokens that the AM has will be invalid, and the tokens that the RM had will also not work to launch a new AM. We need to provide an API that will allow the RM to replace the current tokens for this application with a new set. To avoid any real race issues, I think this API should be something that the AM calls, so that the client can connect to the AM with a new set of tokens it got using kerberos, then the AM can inform the RM of the new set of tokens and quickly update its tokens internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999451#comment-13999451 ] Anubhav Dhoot commented on YARN-1366: - Seems like we are going with no resync api for now as per the current patch. I think its a good idea to hold of on the new API unless we see a need. I feel there isnt a strong case for it yet. There a few issues i see which will need a little more work. Pending releases - AM forgets about a request to release once its made. We will have to reissue a release request after RM restart to be safe (add also make sure RM can handle a duplicate of that). Otherwise we have a resource leak if RM has not issued the release before it restarted. One way is to remember all releases in a new SetContainerId pendingReleases in RMContainerRequestor and remove it by processing the getCompletedContainersStatuses in makeRemoteRequest or a new function that it calls. {code} +blacklistAdditions.addAll(blacklistedNodes); {code} Blacklisting has logic in ignoreBlacklisting to ignore it if we cross a threshold. So we can do {code} if (!ignoreBlacklisting.get()) { blacklistAdditions.addAll(blacklistedNodes); } {code} There a few places where the line exceeds 80 chars. Otherwise it looks good. Lets add some tests and validate this. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999689#comment-13999689 ] Tsuyoshi OZAWA commented on YARN-1474: -- [~kkambatl], can you check a latest patch and kick the Jenkins? I have no permission to kick the Jenkins. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999724#comment-13999724 ] Sunil G commented on YARN-2055: --- Thank you Mayank for the clarification. I have a small doubt here. In such scenarios, is it like scheduler should not assign any more container for Queue A? Assuming that here Queue B is demand is there, then only Queue B's requests has to be served first. Am I correct? Preemption: Jobs are failing due to AMs are getting launched and killed multiple times -- Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999771#comment-13999771 ] Karthik Kambatla commented on YARN-2054: bq. If we want these configs to match up with yarn.resourcemanager.zk-timeout-ms and (as YARN-1878 is trying) if that can change, we need to somehow make them linked dynamically? These configs need not match, but in an HA setting, it might not make a lot of sense to have these significantly different. bq. Does it make sense to link with the config HA enabled also ? If we have another RM sitting standby, we may want to failover quickly. But if we have only one RM, and somehow ZK is unavailable, RM will only retry for 10 seconds and shuts down. Good point. May be, we can come up with a good value for retry-interval based on whether HA is enabled and yarn.resourcemanager.zk-timeout-ms. Poor defaults for YARN ZK configs for retries and retry-inteval --- Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2054-1.patch Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999558#comment-13999558 ] Junping Du commented on YARN-1338: -- Hi [~jlowe], thanks for contributing a patch here. Looks like the latest patch include some code in YARN-1987 which is already committed. Would you mind to update it so that I can start to review and comment? Thanks! Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2068) FairScheduler uses the same ResourceCalculator for all policies
Karthik Kambatla created YARN-2068: -- Summary: FairScheduler uses the same ResourceCalculator for all policies Key: YARN-2068 URL: https://issues.apache.org/jira/browse/YARN-2068 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla FairScheduler uses the same ResourceCalculator for all policies including DRF. Need to fix that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999173#comment-13999173 ] Hadoop QA commented on YARN-2054: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644753/yarn-2054-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3747//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3747//console This message is automatically generated. Poor defaults for YARN ZK configs for retries and retry-inteval --- Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2054-1.patch Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998625#comment-13998625 ] Tsuyoshi OZAWA commented on YARN-1365: -- Oops, this comment is for YARN-1367. I'll comment it on YARN-1367. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999233#comment-13999233 ] Hadoop QA commented on YARN-2017: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645077/YARN-2017.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1277 javac compiler warnings (more than the trunk's current 1276 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3748//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3748//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3748//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3748//console This message is automatically generated. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1751) Improve MiniYarnCluster for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999003#comment-13999003 ] Hudson commented on YARN-1751: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1751. Improve MiniYarnCluster for log aggregation testing. Contributed by Ming Ma (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594275) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java Improve MiniYarnCluster for log aggregation testing --- Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Fix For: 3.0.0, 2.5.0 Attachments: YARN-1751-trunk.patch, YARN-1751.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests
[ https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1343#comment-1343 ] Chris Riccomini commented on YARN-2027: --- K, feel free to close. I'm fairly sure that I tried a host with a null rack during testing and it didn't work, but it might have been on the FIFO scheduler. Either way, we've figured out a workaround to our problem, and [~zhiguohong] has verified functionality on a real cluster, so I'm OK with closing this ticket out. YARN ignores host-specific resource requests Key: YARN-2027 URL: https://issues.apache.org/jira/browse/YARN-2027 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.0 Environment: RHEL 6.1 YARN 2.4 Reporter: Chris Riccomini YARN appears to be ignoring host-level ContainerRequests. I am creating a container request with code that pretty closely mirrors the DistributedShell code: {code} protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) { info(Requesting %d container(s) with %dmb of memory format (containers, memMb)) val capability = Records.newRecord(classOf[Resource]) val priority = Records.newRecord(classOf[Priority]) priority.setPriority(0) capability.setMemory(memMb) capability.setVirtualCores(cpuCores) // Specifying a host in the String[] host parameter here seems to do nothing. Setting relaxLocality to false also doesn't help. (0 until containers).foreach(idx = amClient.addContainerRequest(new ContainerRequest(capability, null, null, priority))) } {code} When I run this code with a specific host in the ContainerRequest, YARN does not honor the request. Instead, it puts the container on an arbitrary host. This appears to be true for both the FifoScheduler and the CapacityScheduler. Currently, we are running the CapacityScheduler with the following settings: {noformat} configuration property nameyarn.scheduler.capacity.maximum-applications/name value1/value description Maximum number of applications that can be pending and running. /description /property property nameyarn.scheduler.capacity.maximum-am-resource-percent/name value0.1/value description Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. /description /property property nameyarn.scheduler.capacity.resource-calculator/name valueorg.apache.hadoop.yarn.util.resource.DefaultResourceCalculator/value description The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. /description /property property nameyarn.scheduler.capacity.root.queues/name valuedefault/value description The queues at the this level (root is the root queue). /description /property property nameyarn.scheduler.capacity.root.default.capacity/name value100/value descriptionSamza queue target capacity./description /property property nameyarn.scheduler.capacity.root.default.user-limit-factor/name value1/value description Default queue user limit a percentage from 0.0 to 1.0. /description /property property nameyarn.scheduler.capacity.root.default.maximum-capacity/name value100/value description The maximum capacity of the default queue. /description /property property nameyarn.scheduler.capacity.root.default.state/name valueRUNNING/value description The state of the default queue. State can be one of RUNNING or STOPPED. /description /property property nameyarn.scheduler.capacity.root.default.acl_submit_applications/name value*/value description The ACL of who can submit jobs to the default queue. /description /property property nameyarn.scheduler.capacity.root.default.acl_administer_queue/name value*/value description The ACL of who can administer jobs on the default queue. /description /property property nameyarn.scheduler.capacity.node-locality-delay/name value40/value description Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in one rack which
[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2049: -- Attachment: YARN-2049.2.patch Fix a bug in the previous patch: When creating the delegation token, we shouldn't use the current user to server as the owner of the DT, because the current user is going to be the user of the timeline server. On the other side, we also cannot use the remote user from AuthenticationFilter, because before passing AuthenticationFilter, the user is still not logged in, and the remote user from HttpServletRequest is going to be dr.who by default, given static user filter is applied before. The right way is get the user name from authentication token, because at this point the kerberos authentication is passed, and authentication token's user name is actually the client kerberos principle, which is the right one we want to use. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998948#comment-13998948 ] Jian He commented on YARN-2053: --- LGTM, +1, submit the same patch to kick jenkins Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1514: - Attachment: YARN-1514.wip.patch Attached a WIP patch. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.5.0 Attachments: YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
Steve Loughran created YARN-2065: Summary: AM cannot create new containers after restart-NM token from previous attempt used Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1424) RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to return when attempt not active
[ https://issues.apache.org/jira/browse/YARN-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang reassigned YARN-1424: Assignee: Ray Chiang RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to return when attempt not active Key: YARN-1424 URL: https://issues.apache.org/jira/browse/YARN-1424 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Sandy Ryza Assignee: Ray Chiang Priority: Minor Labels: newbie RMAppImpl has a DUMMY_APPLICATION_RESOURCE_USAGE_REPORT to return when the caller of createAndGetApplicationReport doesn't have access. RMAppAttemptImpl should have something similar for getApplicationResourceUsageReport. It also might make sense to put the dummy report into ApplicationResourceUsageReport and allow both to use it. A test would also be useful to verify that RMAppAttemptImpl#getApplicationResourceUsageReport doesn't return null if the scheduler doesn't have a report to return. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting
[ https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1569: Assignee: (was: Anubhav Dhoot) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting - Key: YARN-1569 URL: https://issues.apache.org/jira/browse/YARN-1569 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Junping Du Priority: Minor Labels: newbie As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should always check appropriate type before casting. handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so far (no bug there now) but should be improved as FairScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2067) FairScheduler update/continuous-scheduling threads should start only when after the scheduler is started
Karthik Kambatla created YARN-2067: -- Summary: FairScheduler update/continuous-scheduling threads should start only when after the scheduler is started Key: YARN-2067 URL: https://issues.apache.org/jira/browse/YARN-2067 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2056: -- Fix Version/s: (was: 2.1.0-beta) Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1550: Attachment: YARN-1550.001.patch Updated caolong's patch NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1338: - Attachment: YARN-1338v4.patch Updating patch to trunk. Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo reassigned YARN-2066: - Assignee: Hong Zhiguo Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999018#comment-13999018 ] Hudson commented on YARN-1362: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1362. Distinguish between nodemanager shutdown for decommission vs shutdown for restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594421) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1936: -- Attachment: YARN-1936.2.patch Upload a new patch: We shouldn't request the timeline DT when the timeline services is not enabled. Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch, YARN-1936.2.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998624#comment-13998624 ] Tsuyoshi OZAWA commented on YARN-2061: -- s/RACE/TRACE/ Revisit logging levels in ZKRMStateStore - Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Labels: newbie ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2061) Revisit logging levels in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2061: - Attachment: YARN2061-01.patch Patch to move several LOG.info messages to LOG.debug. Cleans up messages a bit and adds some consistency to messages from the same method. Revisit logging levels in ZKRMStateStore - Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Labels: newbie Attachments: YARN2061-01.patch ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.4.patch Same patch to kick jenkins Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1799) Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff
[ https://issues.apache.org/jira/browse/YARN-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-1799: - Assignee: Sunil G Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff - Key: YARN-1799 URL: https://issues.apache.org/jira/browse/YARN-1799 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Sunil G Assignee: Sunil G LocalDirAllocator provides paths for all tasks for its local write. This considers the good list of directories which are selected by the HealthCheck mechamnism in LocalDirsHandlerService getLocalPathForWrite() considers whether input demand size can meet the capacity in lastAccessed directory. If more tasks asks for path from LocalDirAllocator, then it is possible that the allocation is done based on the current disk availability at that given time. But this path would have earlier given to some other tasks to write and they may be sequentially doing writing. It is better to check for an upper cutoff for disk availability -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1981) Nodemanager version is not updated when a node reconnects
[ https://issues.apache.org/jira/browse/YARN-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999021#comment-13999021 ] Hudson commented on YARN-1981: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1981. Nodemanager version is not updated when a node reconnects (Jason Lowe via jeagles) (jeagles: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594358) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java Nodemanager version is not updated when a node reconnects - Key: YARN-1981 URL: https://issues.apache.org/jira/browse/YARN-1981 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 3.0.0, 2.5.0 Attachments: YARN-1981.patch When a nodemanager is quickly restarted and happens to change versions during the restart (e.g.: rolling upgrade scenario) the NM version as reported by the RM is not updated. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2070) DistributedShell publish unfriendly user information to the timeline server
Zhijie Shen created YARN-2070: - Summary: DistributedShell publish unfriendly user information to the timeline server Key: YARN-2070 URL: https://issues.apache.org/jira/browse/YARN-2070 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Priority: Minor Bellow is the code of using the string of current user object as the user value. {code} entity.addPrimaryFilter(user, UserGroupInformation.getCurrentUser() .toString()); {code} When we use kerberos authentication, it's going to output the full name, such as zjshen/localhost@LOCALHOST (auth.KERBEROS). It is not user friendly for searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server
[ https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2070: -- Summary: DistributedShell publishes unfriendly user information to the timeline server (was: DistributedShell publish unfriendly user information to the timeline server) DistributedShell publishes unfriendly user information to the timeline server - Key: YARN-2070 URL: https://issues.apache.org/jira/browse/YARN-2070 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Priority: Minor Bellow is the code of using the string of current user object as the user value. {code} entity.addPrimaryFilter(user, UserGroupInformation.getCurrentUser() .toString()); {code} When we use kerberos authentication, it's going to output the full name, such as zjshen/localhost@LOCALHOST (auth.KERBEROS). It is not user friendly for searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1996: --- Description: Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unhealthy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. was: Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health
[jira] [Updated] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1354: - Attachment: YARN-1354-v3.patch Updated patch now that YARN-1987 and YARN-1362 have been committed. Recover applications upon nodemanager restart - Key: YARN-1354 URL: https://issues.apache.org/jira/browse/YARN-1354 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1969) Fair Scheduler: Add policy for Earliest Endtime First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1969: --- Summary: Fair Scheduler: Add policy for Earliest Endtime First (was: Fair Scheduler: Add policy for Earliest Deadline First) Fair Scheduler: Add policy for Earliest Endtime First - Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1424) RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to return when attempt not active
[ https://issues.apache.org/jira/browse/YARN-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-1424: - Attachment: YARN1424-01.patch First version of a potential patch. - Moves DUMMY_APPLICATION_RESOURCE_USAGE_REPORT RMAppImpl to RMServerUtils. Cannot move this to ApplicationResourceUsageReport, since it exists in the hadoop-yarn-api module as opposed to everything else being in the hadoop-yarn-server module. - Uses the reference in RMAppImpl and RMAppAttemptImpl. - No unit tests in this particular patch file. RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to return when attempt not active Key: YARN-1424 URL: https://issues.apache.org/jira/browse/YARN-1424 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Sandy Ryza Assignee: Ray Chiang Priority: Minor Labels: newbie Attachments: YARN1424-01.patch RMAppImpl has a DUMMY_APPLICATION_RESOURCE_USAGE_REPORT to return when the caller of createAndGetApplicationReport doesn't have access. RMAppAttemptImpl should have something similar for getApplicationResourceUsageReport. It also might make sense to put the dummy report into ApplicationResourceUsageReport and allow both to use it. A test would also be useful to verify that RMAppAttemptImpl#getApplicationResourceUsageReport doesn't return null if the scheduler doesn't have a report to return. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1969) Fair Scheduler: Add policy for Earliest Endtime First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1969: --- Description: What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, we need a relaxed version of *Earliest Endtime First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to leaf queues which have applications. was: What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. Fair Scheduler: Add policy for Earliest Endtime First - Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, we need a relaxed version of *Earliest Endtime First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to leaf queues which have applications. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998911#comment-13998911 ] Ray Chiang commented on YARN-2061: -- One other observation. For the various LOG.info() statements in a catch block, should those be LOG.error() or does it make sense for those to stay LOG.info()? Revisit logging levels in ZKRMStateStore - Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Labels: newbie Attachments: YARN2061-01.patch ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
Ted Yu created YARN-2066: Summary: Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-2066: -- Attachment: YARN-2066.patch Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2066.patch {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.5.patch Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1962) Timeline server is enabled by default
[ https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994006#comment-13994006 ] Jason Lowe commented on YARN-1962: -- +1 lgtm. Will commit this early next week to give [~zjshen] a chance to comment. Timeline server is enabled by default - Key: YARN-1962 URL: https://issues.apache.org/jira/browse/YARN-1962 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Attachments: YARN-1962.1.patch, YARN-1962.2.patch Since Timeline server is not matured and secured yet, enabling it by default might create some confusion. We were playing with 2.4.0 and found a lot of exceptions for distributed shell example related to connection refused error. Btw, we didn't run TS because it is not secured yet. Although it is possible to explicitly turn it off through yarn-site config. In my opinion, this extra change for this new service is not worthy at this point,. This JIRA is to turn it off by default. If there is an agreement, i can put a simple patch about this. {noformat} 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response from the timeline server. com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response from the timeline server. com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198) at
[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999111#comment-13999111 ] Jian He commented on YARN-2061: --- Hi Ray, thanks for cleaning it up. I think a reasonable way is to put info level in unusual condition which helps debugging in most cases, and debug level in usual condition which avoids excessive loggings. Revisit logging levels in ZKRMStateStore - Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Labels: newbie Attachments: YARN2061-01.patch ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999020#comment-13999020 ] Hudson commented on YARN-1861: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was causing both RMs to be stuck in standby mode when automatic failover is enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594356) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.4.1 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998660#comment-13998660 ] Tsuyoshi OZAWA commented on YARN-1514: -- Rough design: 1. Launch ZKRMStateStore and initialize ZooKeeper. 2. Creating znodes based on given option. 3. Run loadState() and show how much time it takes. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.5.0 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1861: --- Attachment: yarn-1861-6.patch Updated new patch (yarn-1861-6.patch) to fix the nits. Also, the RM could transition to Standby and immediately transition back to Active - reduced the sleep between retries to 1 ms, and changed the assert after the loop to use the number of attempts. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk
[ https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994129#comment-13994129 ] Junping Du commented on YARN-766: - Hi [~sseth], the patch against trunk make sense to me. So I update the name of JIRA to mention a format inconsistent here. Will commit is shortly. TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk Key: YARN-766 URL: https://issues.apache.org/jira/browse/YARN-766 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.1.0-beta Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Minor Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt File scriptFile = new File(tmpDir, scriptFile.sh); should be replaced with File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile); to match trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1937) Add entity-level access control of the timeline data for owners only
[ https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1937: -- Attachment: YARN-1937.3.patch I've tested the patch on a single node cluster, which seems to work fine generally. Fix one bug I've found in the new patch. Add entity-level access control of the timeline data for owners only Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server
[ https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2070: -- Labels: newbie (was: ) DistributedShell publishes unfriendly user information to the timeline server - Key: YARN-2070 URL: https://issues.apache.org/jira/browse/YARN-2070 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Priority: Minor Labels: newbie Bellow is the code of using the string of current user object as the user value. {code} entity.addPrimaryFilter(user, UserGroupInformation.getCurrentUser() .toString()); {code} When we use kerberos authentication, it's going to output the full name, such as zjshen/localhost@LOCALHOST (auth.KERBEROS). It is not user friendly for searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999719#comment-13999719 ] Tsuyoshi OZAWA commented on YARN-1365: -- Sure! I'll check it. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1935) Security for timeline server
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1935: -- Attachment: Timeline_Kerberos_DT_ACLs.patch I created an uber patch which integrate the pieces I've done so far. With this patch the timeline server can work in a secure mode (except the generic history service part): 1. Timeline server can start and login with Kerberors principle and keytab; 2. The user either passed the Kerberos authentication or obtained the timeline delegation token can get access to the timeline data; 3. Withe ACLs enabled, only the owner who published the timeline data before can access the data. Folks who are interested in the timeline security can play with the patch. Security for timeline server Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Zhijie Shen Attachments: Timeline_Kerberos_DT_ACLs.patch Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2034: -- Attachment: YARN-2034.patch Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Attachments: YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998592#comment-13998592 ] Tsuyoshi OZAWA commented on YARN-1365: -- I've read your code. The prototype is including following changes: 1. Changed NodeManager's RegisterNodeManagerRequest to send ContainerReport. 2. Added Configuration about RM_WORK_PRESERVING_RECOVERY_ENABLED. 3. Added cluster timestamp to Container Id. I think we should focus on NM should resync with the RM when the RM_WORK_PRESERVING_RECOVERY_ENABLED is set to true. Can you add resync code(ResourceManager's side code) into the patch? Also, in regard to ContainerId format, let's discuss on YARN-2052. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1986) In Fifo Scheduler, node heartbeat in between creating app and attempt causes NPE
[ https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999022#comment-13999022 ] Hudson commented on YARN-1986: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1986. In Fifo Scheduler, node heartbeat in between creating app and attempt causes NPE (Hong Zhiguo via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594476) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java In Fifo Scheduler, node heartbeat in between creating app and attempt causes NPE Key: YARN-1986 URL: https://issues.apache.org/jira/browse/YARN-1986 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jon Bringhurst Assignee: Hong Zhiguo Priority: Critical Fix For: 2.4.1 Attachments: YARN-1986-2.patch, YARN-1986-3.patch, YARN-1986-testcase.patch, YARN-1986.patch After upgrade from 2.2.0 to 2.4.0, NPE on first job start. -After RM was restarted, the job runs without a problem.- {noformat} 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) at java.lang.Thread.run(Thread.java:744) 19:11:13,443 INFO ResourceManager:604 - Exiting, bbye.. {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000434#comment-14000434 ] Hadoop QA commented on YARN-1550: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645277/YARN-1550.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3756//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3756//console This message is automatically generated. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998631#comment-13998631 ] Tsuyoshi OZAWA commented on YARN-1367: -- Some comments against a patch: 1. Can you fix the indent? {code} + public boolean isWorkPreservingRestartEnabled() { return + isWorkPreservingRestartEnabled; + } {code} {code} + if (!rmWorkPreservingRestartEnbaled) + { +containerManager.cleanupContainersOnNMResync(); + } {code} 2. IMO, recovery.work-preserving-restart.enabled is more appropriate because this is one of options under RECOVERY_ENABLED namespace. {code} public static final String RM_WORK_PRESERVING_RECOVERY_ENABLED = RM_PREFIX + work-preserving.recovery.enabled; {code} After restart NM should resync with the RM without killing containers - Key: YARN-1367 URL: https://issues.apache.org/jira/browse/YARN-1367 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1367.prototype.patch After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998823#comment-13998823 ] Wangda Tan commented on YARN-1368: -- Sorry I went to wrong JIRA, please ignore above comment :-/ Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993246#comment-13993246 ] Wangda Tan commented on YARN-2030: -- +1 for this idea, I think we should handle this neatly to avoid possible bugs. Use StateMachine to simplify handleStoreEvent() in RMStateStore --- Key: YARN-2030 URL: https://issues.apache.org/jira/browse/YARN-2030 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Now the logic to handle different store events in handleStoreEvent() is as following: {code} if (event.getType().equals(RMStateStoreEventType.STORE_APP) || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } ... try { if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT) || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) { ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } ... if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) { ... } else { ... } } ... } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) { ... } else { ... } } {code} This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1918) Typo in description and error message for 'yarn.resourcemanager.cluster-id'
[ https://issues.apache.org/jira/browse/YARN-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anandha L Ranganathan updated YARN-1918: Attachment: YARN-1918.1.patch Typo in description and error message for 'yarn.resourcemanager.cluster-id' --- Key: YARN-1918 URL: https://issues.apache.org/jira/browse/YARN-1918 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Devaraj K Assignee: Anandha L Ranganathan Priority: Trivial Labels: newbie Attachments: YARN-1918.1.patch 1. In yarn-default.xml {code:xml} property descriptionName of the cluster. In a HA setting, this is used to ensure the RM participates in leader election fo this cluster and ensures it does not affect other clusters/description nameyarn.resourcemanager.cluster-id/name !--valueyarn-cluster/value-- /property {code} Here the line 'election fo this cluster and ensures it does not affect' should be replaced with 'election for this cluster and ensures it does not affect'. 2. {code:xml} org.apache.hadoop.HadoopIllegalArgumentException: Configuration doesn't specifyyarn.resourcemanager.cluster-id at org.apache.hadoop.yarn.conf.YarnConfiguration.getClusterId(YarnConfiguration.java:1336) {code} In the above exception message, it is missing a space between message and configuration name. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2012: - Attachment: YARN-2012-v2.txt Patch refreshed. Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute - Key: YARN-2012 URL: https://issues.apache.org/jira/browse/YARN-2012 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This queue should be an existing queue,if not we fall back to root.default queue hence keeping this rule as terminal. This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000435#comment-14000435 ] Hadoop QA commented on YARN-1354: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645314/YARN-1354-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3757//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3757//console This message is automatically generated. Recover applications upon nodemanager restart - Key: YARN-1354 URL: https://issues.apache.org/jira/browse/YARN-1354 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000471#comment-14000471 ] Hadoop QA commented on YARN-2049: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645260/YARN-2049.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3760//console This message is automatically generated. Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch, YARN-2049.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999533#comment-13999533 ] Hadoop QA commented on YARN-2053: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645047/YARN-2053.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3751//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3751//console This message is automatically generated. Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Assignee: Wangda Tan Fix For: 2.4.1 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1339: - Attachment: YARN-1339v4.patch Updated patch now that YARN-1987 has been committed. Recover DeletionService state upon nodemanager restart -- Key: YARN-1339 URL: https://issues.apache.org/jira/browse/YARN-1339 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1339.patch, YARN-1339v2.patch, YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2069: -- Fix Version/s: (was: 2.1.0-beta) Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000284#comment-14000284 ] Bikas Saha commented on YARN-1366: -- bq. Seems like we are going with no resync api for now as per the current patch. I think its a good idea to hold of on the new API unless we see a need. I feel there isnt a strong case for it yet. I dont think we can summarily make such a choice without a proper discussion. Again, I am not advocating either choice. But we should understand the approaches and their effects on the system (users + back-end implementation) before we make a call on the API. My last comment opened the discussion with some questions and it would be great if the assignee ([~rohithsharma] and other committers/contributors express their understanding and insight on those questions. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000448#comment-14000448 ] Hadoop QA commented on YARN-1936: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645288/YARN-1936.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3758//console This message is automatically generated. Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch, YARN-1936.2.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998979#comment-13998979 ] Anubhav Dhoot commented on YARN-1365: - Hi [~ozawa] just saw your comment after i had it ready. Can you please help review the tests i added. Thanks. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1976) Tracking url missing http protocol for FAILED application
[ https://issues.apache.org/jira/browse/YARN-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999004#comment-13999004 ] Hudson commented on YARN-1976: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1976. Fix CHANGES.txt for YARN-1976. (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594123) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Tracking url missing http protocol for FAILED application - Key: YARN-1976 URL: https://issues.apache.org/jira/browse/YARN-1976 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Junping Du Fix For: 2.4.1 Attachments: YARN-1976-v2.patch, YARN-1976.patch Run yarn application -list -appStates FAILED, It does not print http protocol name like FINISHED apps. {noformat} -bash-4.1$ yarn application -list -appStates FINISHED,FAILED,KILLED 14/04/15 23:55:07 INFO client.RMProxy: Connecting to ResourceManager at host Total number of applications (application-types: [] and states: [FINISHED, FAILED, KILLED]):4 Application-IdApplication-Name Application-Type User Queue State Final-State ProgressTracking-URL application_1397598467870_0004 Sleep job MAPREDUCEhrt_qa defaultFINISHED SUCCEEDED 100% http://host:19888/jobhistory/job/job_1397598467870_0004 application_1397598467870_0003 Sleep job MAPREDUCEhrt_qa defaultFINISHED SUCCEEDED 100% http://host:19888/jobhistory/job/job_1397598467870_0003 application_1397598467870_0002 Sleep job MAPREDUCEhrt_qa default FAILED FAILED 100% host:8088/cluster/app/application_1397598467870_0002 application_1397598467870_0001 word count MAPREDUCEhrt_qa defaultFINISHED SUCCEEDED 100% http://host:19888/jobhistory/job/job_1397598467870_0001 {noformat} It only prints 'host:8088/cluster/app/application_1397598467870_0002' instead 'http://host:8088/cluster/app/application_1397598467870_0002' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting
[ https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-1569: --- Assignee: zhihai xu For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting - Key: YARN-1569 URL: https://issues.apache.org/jira/browse/YARN-1569 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Junping Du Assignee: zhihai xu Priority: Minor Labels: newbie As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should always check appropriate type before casting. handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so far (no bug there now) but should be improved as FairScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()
[ https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000393#comment-14000393 ] Hadoop QA commented on YARN-2066: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645233/YARN-2066.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3754//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3754//console This message is automatically generated. Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder() --- Key: YARN-2066 URL: https://issues.apache.org/jira/browse/YARN-2066 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2066.patch {code} if (this.finish != null) { builder.setFinishBegin(start.getMinimumLong()); builder.setFinishEnd(start.getMaximumLong()); } {code} this.finish should be referenced in the if block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting
[ https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-1569: --- Assignee: Anubhav Dhoot For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting - Key: YARN-1569 URL: https://issues.apache.org/jira/browse/YARN-1569 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Junping Du Assignee: Anubhav Dhoot Priority: Minor Labels: newbie As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should always check appropriate type before casting. handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so far (no bug there now) but should be improved as FairScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998822#comment-13998822 ] Wangda Tan commented on YARN-1368: -- LGTM, +1 (non-binding). Please kick off Jenkins building. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.combined.001.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000398#comment-14000398 ] Hadoop QA commented on YARN-1338: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645279/YARN-1338v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 15 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3753//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3753//console This message is automatically generated. Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000528#comment-14000528 ] Karthik Kambatla commented on YARN-1474: Looks like it did run, but couldn't apply the patch. Mind updating? Also, I was wondering whether we should change the signature of {{reinitialize()}}. FWIW, I am +0 to changing it. # I understand passing the RMContext is not required anymore, and is better to change it so we don't accumulate more code calling it. # However, that is an incompatible change to the ResourceScheduler which is Private. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0, 2.4.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-2065: - Assignee: Jian He AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000477#comment-14000477 ] Hadoop QA commented on YARN-1339: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645284/YARN-1339v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3759//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3759//console This message is automatically generated. Recover DeletionService state upon nodemanager restart -- Key: YARN-1339 URL: https://issues.apache.org/jira/browse/YARN-1339 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1339.patch, YARN-1339v2.patch, YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)