[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998337#comment-13998337 ] Sandy Ryza commented on YARN-2054: -- +1 > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998324#comment-13998324 ] Wangda Tan commented on YARN-2053: -- Sure, I'll do that, thanks for review! > Slider AM fails to restart: NPE in > RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts > > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty >Assignee: Wangda Tan > Attachments: YARN-2053.patch, > yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {code} > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
Karthik Kambatla created YARN-2062: -- Summary: Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover Key: YARN-2062 URL: https://issues.apache.org/jira/browse/YARN-2062 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla On busy clusters, we see several {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events invoked against NEW nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1975) Used resources shows escaped html in CapacityScheduler and FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998172#comment-13998172 ] Hudson commented on YARN-1975: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1779 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1779/]) YARN-1975. Fix yarn application CLI to print the scheme of the tracking url of failed/killed applications. Contributed by Junping Du (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1593874) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java > Used resources shows escaped html in CapacityScheduler and FairScheduler page > - > > Key: YARN-1975 > URL: https://issues.apache.org/jira/browse/YARN-1975 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0, 2.4.0 >Reporter: Nathan Roberts >Assignee: Mit Desai > Fix For: 3.0.0, 2.4.1 > > Attachments: YARN-1975.patch, screenshot-1975.png > > > Used resources displays as <memory:, vCores;> with capacity > scheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2061) Revisit logging levels in ZKRMStateStore
Karthik Kambatla created YARN-2061: -- Summary: Revisit logging levels in ZKRMStateStore Key: YARN-2061 URL: https://issues.apache.org/jira/browse/YARN-2061 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997804#comment-13997804 ] Jian He commented on YARN-2053: --- Thanks for catching this ! Patch looks good to me, Wangda, can you add a test case ? Basically, allocate 2 containers on the same node in TestAMRestart#testNMTokensRebindOnAMRestart should be enough. > Slider AM fails to restart: NPE in > RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts > > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty >Assignee: Wangda Tan > Attachments: YARN-2053.patch, > yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {code} > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
Nishkam Ravi created YARN-2041: -- Summary: Hard to co-locate MR2 and Spark jobs on the same cluster in YARN Key: YARN-2041 URL: https://issues.apache.org/jira/browse/YARN-2041 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Nishkam Ravi Fix For: 2.4.0, 2.3.0 Performance of MR2 jobs falls drastically as YARN config parameter yarn.nodemanager.resource.memory-mb is increased beyond a certain value. Performance of Spark falls drastically as the value of yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a large data set. This makes it hard to co-locate MR2 and Spark jobs in YARN. The experiments are being conducted on a 6-node cluster. The following workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, ShuffleText and PageRank. Will add more details to this JIRA over time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997132#comment-13997132 ] Jian He commented on YARN-1368: --- New patch uploaded. > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1976) Tracking url missing http protocol for FAILED application
[ https://issues.apache.org/jira/browse/YARN-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994158#comment-13994158 ] Junping Du commented on YARN-1976: -- Hi [~jianhe], would you mind to review it again? Thx! > Tracking url missing http protocol for FAILED application > - > > Key: YARN-1976 > URL: https://issues.apache.org/jira/browse/YARN-1976 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Junping Du > Attachments: YARN-1976-v2.patch, YARN-1976.patch > > > Run yarn application -list -appStates FAILED, It does not print http > protocol name like FINISHED apps. > {noformat} > -bash-4.1$ yarn application -list -appStates FINISHED,FAILED,KILLED > 14/04/15 23:55:07 INFO client.RMProxy: Connecting to ResourceManager at host > Total number of applications (application-types: [] and states: [FINISHED, > FAILED, KILLED]):4 > Application-IdApplication-Name > Application-Type User Queue State > Final-State ProgressTracking-URL > application_1397598467870_0004 Sleep job > MAPREDUCEhrt_qa defaultFINISHED > SUCCEEDED 100% > http://host:19888/jobhistory/job/job_1397598467870_0004 > application_1397598467870_0003 Sleep job > MAPREDUCEhrt_qa defaultFINISHED > SUCCEEDED 100% > http://host:19888/jobhistory/job/job_1397598467870_0003 > application_1397598467870_0002 Sleep job > MAPREDUCEhrt_qa default FAILED >FAILED 100% > host:8088/cluster/app/application_1397598467870_0002 > application_1397598467870_0001 word count > MAPREDUCEhrt_qa defaultFINISHED > SUCCEEDED 100% > http://host:19888/jobhistory/job/job_1397598467870_0001 > {noformat} > It only prints 'host:8088/cluster/app/application_1397598467870_0002' instead > 'http://host:8088/cluster/app/application_1397598467870_0002' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992711#comment-13992711 ] Hadoop QA commented on YARN-2022: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643727/Yarn-2022.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3718//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3718//console This message is automatically generated. > Preempting an Application Master container can be kept as least priority when > multiple applications are marked for preemption by > ProportionalCapacityPreemptionPolicy > - > > Key: YARN-2022 > URL: https://issues.apache.org/jira/browse/YARN-2022 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-2022.1.patch > > > Cluster Size = 16GB [2NM's] > Queue A Capacity = 50% > Queue B Capacity = 50% > Consider there are 3 applications running in Queue A which has taken the full > cluster capacity. > J1 = 2GB AM + 1GB * 4 Maps > J2 = 2GB AM + 1GB * 4 Maps > J3 = 2GB AM + 1GB * 2 Maps > Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. > Currently in this scenario, Jobs J3 will get killed including its AM. > It is better if AM can be given least priority among multiple applications. > In this same scenario, map tasks from J3 and J2 can be preempted. > Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2057) NPE in RM handling node update while app submission in progress
[ https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved YARN-2057. -- Resolution: Duplicate > NPE in RM handling node update while app submission in progress > --- > > Key: YARN-2057 > URL: https://issues.apache.org/jira/browse/YARN-2057 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test > TestDestroyMasterlessAM >Reporter: Steve Loughran > > One of our test runs finished prematurely with an NPE in the RM, followed by > the RM thread calling system.exit(). It looks like an NM update came in while > the app was still being set up, causing confusion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2057) NPE in RM handling node update while app submission in progress
[ https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997762#comment-13997762 ] Steve Loughran commented on YARN-2057: -- you're right, the stack trace and scenario -"heartbeat during app launch" seems to match; closing as duplicate > NPE in RM handling node update while app submission in progress > --- > > Key: YARN-2057 > URL: https://issues.apache.org/jira/browse/YARN-2057 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test > TestDestroyMasterlessAM >Reporter: Steve Loughran > > One of our test runs finished prematurely with an NPE in the RM, followed by > the RM thread calling system.exit(). It looks like an NM update came in while > the app was still being set up, causing confusion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2055: Description: If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed. (was: Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs.) > Preemtion: Jobs are failing due to AMs are getting launched and killed > multiple times > - > > Key: YARN-2055 > URL: https://issues.apache.org/jira/browse/YARN-2055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal > Fix For: 2.1.0-beta > > > If Queue A does not have enough capacity to run AM, then AM will borrow > capacity from queue B to run AM in that case AM will be killed if queue B > will reclaim its capacity and again AM will be launched and killed again, in > that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997386#comment-13997386 ] Wangda Tan commented on YARN-1368: -- Hi Jian, Thanks for updating your patch, I took a look at it, some comments, 1) RMAppImpl.java {code} + // Add the current attempt to the scheduler.It'll be removed from + // scheduler in RMAppAttempt#BaseFinalTransition + app.handler.handle(new AppAttemptAddedSchedulerEvent(app.currentAttempt +.getAppAttemptId(), false)); {code} Not quite understand about this, in current trunk code, how RMAppAttempt notify scheduler about "AppAttemptAddedSchedulerEvent" when recovering? If this is a missing part in current trunk, I tend to put sending "AppAttemptAddedSchedulerEvent" code into RMAppAttemptImpl. 2) RMAppAttemptImpl.java {code} + new ContainerFinishedTransition( + new AMContainerCrashedAtRunningTransition(), + RMAppAttemptState.RUNNING)) {code} And {code} + new ContainerFinishedTransition( + new AMContainerCrashedBeforeRunningTransition(), + RMAppAttemptState.LAUNCHED)) {code} I found the "RUNNING" and "LAUNCHED" state are passed in as targetedFinalState, and the targetFinalState will only used in FinalStateSavedTransition, I got confused, could you please elaborate on this? Why use split AMContainerCrashedTransition to two transitions and set their states to RUNNING/LAUNCHED differently. 3) In AbstractYarnScheduler.java 3.1 {code} + if (rmApp == null) { +LOG.error("Skip recovering container " + status ++ " for unknown application."); +continue; + } {code} And {code} + if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) { +if (LOG.isDebugEnabled()) { + LOG.debug("Skip recovering container " + status + + " for unmanaged AM." + rmApp.getApplicationId()); +} +continue; + } {code} And {code} + SchedulerApplication schedulerApp = applications.get(appId); + if (schedulerApp == null) { +LOG.info("Skip recovering container " + status ++ " for unknown SchedulerApplication. Application state is " ++ rmApp.getState()); +continue; + } {code} It's better to make log level more consistency. 3.2 {code} + public RMContainer createContainer(ContainerStatus status, RMNode node) { +Container container = +Container.newInstance(status.getContainerId(), node.getNodeID(), + node.getHttpAddress(), Resource.newInstance(1024, 1), + Priority.newInstance(0), null); {code} Should we change Resource(1024, 1) to its actually resource? 3.3 For recoverContainersOnNode, is it possible NODE_ADDED happened before APP_ADDED? I ask this before container may be recovered before its application added to scheduler if yes. 4) In FiCaSchedulerNode.java: {code} + @Override + public void recoverContainer(RMContainer rmContainer) { +if (rmContainer.getState().equals(RMContainerState.COMPLETED)) { + return; +} +allocateContainer(null, rmContainer); + } {code} Since the allocateContainer doesn't use application-id parameter, I think it's better to remove it. 5) In TestWorkPreservingRMRestart.java {code} +assertEquals(usedCapacity, queue.getAbsoluteUsedCapacity(), 0); {code} It may better to use two parameter assertEquals, because delta is 0 > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994162#comment-13994162 ] Hadoop QA commented on YARN-2016: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644248/YARN-2016.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3730//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3730//console This message is automatically generated. > Yarn getApplicationRequest start time range is not honored > -- > > Key: YARN-2016 > URL: https://issues.apache.org/jira/browse/YARN-2016 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Venkat Ranganathan >Assignee: Junping Du > Attachments: YARN-2016.patch, YarnTest.java > > > When we query for the previous applications by creating an instance of > GetApplicationsRequest and setting the start time range and application tag, > we see that the start range provided is not honored and all applications with > the tag are returned > Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994014#comment-13994014 ] Karthik Kambatla commented on YARN-1861: I am obviously a +1 because I wrote the patch. Can someone other than Xuan and me take a look? > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996780#comment-13996780 ] Karthik Kambatla commented on YARN-2016: Sorry for missing those merge-backs. A simple unit test like here wouldn't have let the mistake happen. > Yarn getApplicationRequest start time range is not honored > -- > > Key: YARN-2016 > URL: https://issues.apache.org/jira/browse/YARN-2016 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Venkat Ranganathan >Assignee: Junping Du > Fix For: 2.4.1 > > Attachments: YARN-2016.patch, YarnTest.java > > > When we query for the previous applications by creating an instance of > GetApplicationsRequest and setting the start time range and application tag, > we see that the start range provided is not honored and all applications with > the tag are returned > Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
[ https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996494#comment-13996494 ] Jason Lowe commented on YARN-2014: -- I did a bit of investigation on this, and the problem appears to be around the duration of the tasks. In 2.4 the sleep job tasks are taking about 660 msec longer to execute than they do in 0.23. I didn't nail down exactly where this extra delay was coming from, but I did notice that the tasks in 2.4 are loading over 800 more classes than they do in 0.23. I think most of these are coming from the service loader for FileSystem schemas, as the 2.4 tasks loads every FileSystem available and 0.23 does not. In 0.23 FileSystem schemas are declared in configs, but in 2.4 they are dynamically detected and loaded via a service loader. The ~0.5s delay in the task appears to be a fixed startup cost and is amplified by the AM scalability test since it runs very short tasks (the main portion of the map task lasts 1 second) and multiple tasks are run per map "slot" on the cluster, serializing the task startup delays. > Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9 > > > Key: YARN-2014 > URL: https://issues.apache.org/jira/browse/YARN-2014 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: patrick white > > Performance comparison benchmarks from 2.x against 0.23 shows AM scalability > benchmark's runtime is approximately 10% slower in 2.4.0. The trend is > consistent across later releases in both lines, latest release numbers are: > 2.4.0.0 runtime 255.6 seconds (avg 5 passes) > 0.23.9.12 runtime 230.4 seconds (avg 5 passes) > Diff: -9.9% > AM Scalability test is essentially a sleep job that measures time to launch > and complete a large number of mappers. > The diff is consistent and has been reproduced in both a larger (350 node, > 100,000 mappers) perf environment, as well as a small (10 node, 2,900 > mappers) demo cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
[ https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-2050: -- Attachment: YARN-2050-2.patch Thanks, Jason. You are right. remoteAppLogDir could point to a different type of file system. Here is the updated patch. > Fix LogCLIHelpers to create the correct FileContext > --- > > Key: YARN-2050 > URL: https://issues.apache.org/jira/browse/YARN-2050 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: YARN-2050-2.patch, YARN-2050.patch > > > LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus > the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997621#comment-13997621 ] Wangda Tan commented on YARN-2053: -- Took a look at related code, I think this problem is caused by, In ApplicationMasterService.registerApplicationMaster(), it will add nmTokens from previous attempt's container via a loop. {code} List transferredContainers = ((AbstractYarnScheduler) rScheduler) .getTransferredContainers(applicationAttemptId); if (!transferredContainers.isEmpty()) { response.setContainersFromPreviousAttempts(transferredContainers); List nmTokens = new ArrayList(); for (Container container : transferredContainers) { try { nmTokens.add(rmContext.getNMTokenSecretManager() .createAndGetNMToken(app.getUser(), applicationAttemptId, container);); } {code} And NMTokenSecretManager.createAndGetNMToken() {code} NMToken nmToken = null; if (nodeSet != null) { if (!nodeSet.contains(container.getNodeId())) { ... // set nmToken ... } } return nmToken {code} So if multiple container come from same NM (with same NodeId), null nmToken will be added to NMToken list. And in RegisterApplicationMasterResponsePBImpl.getTokenProtoIterable, it tried to convert a null NMToken to proto {code} @Override public NMTokenProto next() { return convertToProtoFormat(iter.next()); } {code} I think this should be the root cause of this problem, uploaded a patch. > Slider AM fails to restart: NPE in > RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts > > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty > Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {code} > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup
[ https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2036: - Attachment: YARN2036-02.patch First revision based on comments. > Document yarn.resourcemanager.hostname in ClusterSetup > -- > > Key: YARN-2036 > URL: https://issues.apache.org/jira/browse/YARN-2036 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Ray Chiang >Priority: Minor > Fix For: 2.5.0 > > Attachments: YARN2036-01.patch, YARN2036-02.patch > > > ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people > should just be able to use that directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997187#comment-13997187 ] Venkat Ranganathan commented on YARN-2016: -- [~djp] It would be good to have a unit test as I mentioned before.The test case I uploaded was specific to one issue, but tests with directions of the wire transfers and something like that would be also. May be that is something I will consider adding > Yarn getApplicationRequest start time range is not honored > -- > > Key: YARN-2016 > URL: https://issues.apache.org/jira/browse/YARN-2016 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Venkat Ranganathan >Assignee: Junping Du > Fix For: 2.4.1 > > Attachments: YARN-2016.patch, YarnTest.java > > > When we query for the previous applications by creating an instance of > GetApplicationsRequest and setting the start time range and application tag, > we see that the start range provided is not honored and all applications with > the tag are returned > Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996610#comment-13996610 ] Zhijie Shen commented on YARN-667: -- bq. do you have sense on how possibility it could happen in 2.X as you are currently work on ATS? My understanding is that we may have two 2 possible changes in the future: 1. One is data itself. Say if application has one more state in the future, the new RM will try to persist it into the history store, while the old history server or client may not understand it. This should be a common problem with RMStateStore. This is driven by RM itself. 2. The other is change of the timeline server internal. Say in the near future, we modified the file structure in FileSystemApplicationHistoryStore to improve the performance, new FileSystemApplicationHistoryStore may not longer understand the existing data structure written by old FileSystemApplicationHistoryStore. However, I think this part should be taken care of by the timeline server itself. > Data persisted in RM should be versioned > > > Key: YARN-667 > URL: https://issues.apache.org/jira/browse/YARN-667 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth >Assignee: Junping Du > > Includes data persisted for RM restart, NodeManager directory structure and > the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997792#comment-13997792 ] Tsuyoshi OZAWA commented on YARN-1474: -- Could someone help to start Jenkins job? It seems not to work. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996577#comment-13996577 ] Zhijie Shen commented on YARN-2048: --- bq. the only implementation of ApplcationContext is ApplicationHistoryManagerImpl, which retrieves containers information from history store. It's a historic problem of history service development. We plan to have a common interface for both RM and history server to retrieve the app information. bq. How do you fetch the containers info from a historyserver and display it on the RM web? We are supposed to provide the consistent RM and history web UI, but RM web UI shows the running and cached completed apps, while history web UI shows the completed apps. bq. If the information is from history store, seems RM won't get that kind of info until the application is done? Sometimes user's application might be a long-live application, never finish unless user kill it. Not exactly. RM web UI will always show the running app information. Currently, history web UI will only show the app information once the app is completed due to the current history store implementation. After we rebase the store on top of timeline store, we should get rid of the issue. bq. Seems the only way providing containers info to RM is to maintain a list in RMAppAttempImpl, which was my way as well. Months ago, I define ApplcationContext to be common interface for retrieve the app information, including container. Recently, released in 2.4.0, we have already supported the analog RPC interfaces in both RM and history server to retrieve app information, again including getContainer(s). This is reason why in YARN-1809, I'd like to rebase both RM and history web UI to retrieve the app information from the analog RPC interfaces. In this way, both RM and history web UI are showing consistent app information via both CLI and web pages. Of course, we'd like to make REST APIs uniformed as well. > List all of the containers of an application from the yarn web > -- > > Key: YARN-2048 > URL: https://issues.apache.org/jira/browse/YARN-2048 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, webapp >Affects Versions: 2.3.0, 2.4.0, 2.5.0 >Reporter: Min Zhou > Attachments: YARN-2048-trunk-v1.patch > > > Currently, Yarn haven't provide a way to list all of the containers of an > application from its web. This kind of information is needed by the > application user. They can conveniently know how many containers their > applications already acquired as well as which nodes those containers were > launched on. They also want to view the logs of each container of an > application. > One approach is maintain a container list in RMAppImpl and expose this info > to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997722#comment-13997722 ] Bikas Saha commented on YARN-1366: -- Is there any value in combining the re-register and re-sending of pending requests in 1 new "resync" method? I am not arguing in favor of it but it would help if we evaluate the pros/cons and go through the mental exercise of how things would work on the AM and RM side. This is important because we making API changes and these are hard to undo. e.g. pro of new resync method - API clearly specifies that pending requests must be re-submitted. Are there any other advantage on the RM side by having this information come together in 1 "atomic" operation? Does it help the RM to differentiate between an AM that was launched and had registered vs an AM that had been launched but the RM died before the AM could register. Is that important in any case? > ApplicationMasterService should Resync with the AM upon allocate call after > restart > --- > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992698#comment-13992698 ] Hadoop QA commented on YARN-1474: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643920/YARN-1474.10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3717//console This message is automatically generated. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, > YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, > YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests
[ https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997708#comment-13997708 ] Chris Riccomini commented on YARN-2027: --- relaxLocality was set to false. {noformat} (0 until containers).foreach(idx => amClient.addContainerRequest(new ContainerRequest(capability, getHosts, List("/default-rack").toArray[String], priority, false))) {noformat} The last false in that parameter list is relaxLocality. > YARN ignores host-specific resource requests > > > Key: YARN-2027 > URL: https://issues.apache.org/jira/browse/YARN-2027 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.4.0 > Environment: RHEL 6.1 > YARN 2.4 >Reporter: Chris Riccomini > > YARN appears to be ignoring host-level ContainerRequests. > I am creating a container request with code that pretty closely mirrors the > DistributedShell code: > {code} > protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) > { > info("Requesting %d container(s) with %dmb of memory" format (containers, > memMb)) > val capability = Records.newRecord(classOf[Resource]) > val priority = Records.newRecord(classOf[Priority]) > priority.setPriority(0) > capability.setMemory(memMb) > capability.setVirtualCores(cpuCores) > // Specifying a host in the String[] host parameter here seems to do > nothing. Setting relaxLocality to false also doesn't help. > (0 until containers).foreach(idx => amClient.addContainerRequest(new > ContainerRequest(capability, null, null, priority))) > } > {code} > When I run this code with a specific host in the ContainerRequest, YARN does > not honor the request. Instead, it puts the container on an arbitrary host. > This appears to be true for both the FifoScheduler and the CapacityScheduler. > Currently, we are running the CapacityScheduler with the following settings: > {noformat} > > > yarn.scheduler.capacity.maximum-applications > 1 > > Maximum number of applications that can be pending and running. > > > > yarn.scheduler.capacity.maximum-am-resource-percent > 0.1 > > Maximum percent of resources in the cluster which can be used to run > application masters i.e. controls number of concurrent running > applications. > > > > yarn.scheduler.capacity.resource-calculator > > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator > > The ResourceCalculator implementation to be used to compare > Resources in the scheduler. > The default i.e. DefaultResourceCalculator only uses Memory while > DominantResourceCalculator uses dominant-resource to compare > multi-dimensional resources such as Memory, CPU etc. > > > > yarn.scheduler.capacity.root.queues > default > > The queues at the this level (root is the root queue). > > > > yarn.scheduler.capacity.root.default.capacity > 100 > Samza queue target capacity. > > > yarn.scheduler.capacity.root.default.user-limit-factor > 1 > > Default queue user limit a percentage from 0.0 to 1.0. > > > > yarn.scheduler.capacity.root.default.maximum-capacity > 100 > > The maximum capacity of the default queue. > > > > yarn.scheduler.capacity.root.default.state > RUNNING > > The state of the default queue. State can be one of RUNNING or STOPPED. > > > > yarn.scheduler.capacity.root.default.acl_submit_applications > * > > The ACL of who can submit jobs to the default queue. > > > > yarn.scheduler.capacity.root.default.acl_administer_queue > * > > The ACL of who can administer jobs on the default queue. > > > > yarn.scheduler.capacity.node-locality-delay > 40 > > Number of missed scheduling opportunities after which the > CapacityScheduler > attempts to schedule rack-local containers. > Typically this should be set to number of nodes in the cluster, By > default is setting > approximately number of nodes in one rack which is 40. > > > > {noformat} > Digging into the code a bit (props to [~jghoman] for finding this), we have a > theory as to why this is happening. It looks like > RMContainerRequestor.addContainerReq adds three resource requests per > container request: data-local, rack-local, and any: > {code} > protected void addContainerReq(ContainerRequest req) { > // Create resource requests > for (String host : req.hosts) { > // Data-local > if (!isNodeBlacklisted(host)) { > addResourceRequest(req.priority, host, req.capability); > }
[jira] [Commented] (YARN-2011) Typo in TestLeafQueue
[ https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993395#comment-13993395 ] Junping Du commented on YARN-2011: -- Nice catch, [~airbots]! There is also a warning that following code in testAppAttemptMetrics() is never used: {code} FiCaSchedulerApp app_0 = new FiCaSchedulerApp(appAttemptId_0, user_0, a, null, rmContext); {code} Given you already there, would you like to fix this warning in your patch as well? I will review and commit it. Thanks! > Typo in TestLeafQueue > - > > Key: YARN-2011 > URL: https://issues.apache.org/jira/browse/YARN-2011 > Project: Hadoop YARN > Issue Type: Test >Affects Versions: 2.4.0 >Reporter: Chen He >Assignee: Chen He >Priority: Trivial > Attachments: YARN-2011.patch > > > a.assignContainers(clusterResource, node_0); > assertEquals(2*GB, a.getUsedResources().getMemory()); > assertEquals(2*GB, app_0.getCurrentConsumption().getMemory()); > assertEquals(0*GB, app_1.getCurrentConsumption().getMemory()); > assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G > assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G > // Again one to user_0 since he hasn't exceeded user limit yet > a.assignContainers(clusterResource, node_0); > assertEquals(3*GB, a.getUsedResources().getMemory()); > assertEquals(2*GB, app_0.getCurrentConsumption().getMemory()); > assertEquals(1*GB, app_1.getCurrentConsumption().getMemory()); > assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G > assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2057) NPE in RM handling node update while app submission in progress
[ https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997635#comment-13997635 ] Wangda Tan commented on YARN-2057: -- [~ste...@apache.org], I think this is a same issue which is already resolved by YARN-1986. You can take a look and close it if you think so. > NPE in RM handling node update while app submission in progress > --- > > Key: YARN-2057 > URL: https://issues.apache.org/jira/browse/YARN-2057 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test > TestDestroyMasterlessAM >Reporter: Steve Loughran > > One of our test runs finished prematurely with an NPE in the RM, followed by > the RM thread calling system.exit(). It looks like an NM update came in while > the app was still being set up, causing confusion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995745#comment-13995745 ] Hadoop QA commented on YARN-1938: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644488/YARN-1938.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3735//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3735//console This message is automatically generated. > Kerberos authentication for the timeline server > --- > > Key: YARN-1938 > URL: https://issues.apache.org/jira/browse/YARN-1938 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1938.1.patch, YARN-1938.2.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
Karthik Kambatla created YARN-2054: -- Summary: Poor defaults for YARN ZK configs for retries and retry-inteval Key: YARN-2054 URL: https://issues.apache.org/jira/browse/YARN-2054 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Currenly, we have the following default values: # yarn.resourcemanager.zk-num-retries - 500 # yarn.resourcemanager.zk-retry-interval-ms - 2000 This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2057) NPE in RM handling node update while app submission in progress
[ https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997557#comment-13997557 ] Steve Loughran commented on YARN-2057: -- Stack trace. This is a transient failure -and didn't reoccur immediately after- so the log is all there is, I'm afraid Can I note that failing to handle status updates by triggering RM exit is a bit brittle. It exposes the RM to failover if a single NM starts sending in bad data. {code} 2014-05-14 14:48:32,248 [JUnit] DEBUG launch.AbstractLauncher (AbstractLauncher.java:completeContainerLaunch(162)) - Completed setting up container command $JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -Xmx256M -ea -esa org.apache.slider.server.appmaster.SliderAppMaster create test_destroy_masterless_am --debug -cluster-uri file:/Users/stevel/.slider/cluster/test_destroy_masterless_am --rm 192.168.1.86:54470 --fs file:/// -D slider.registry.path=/registry -D slider.zookeeper.quorum=localhost:1 1>/slider-out.txt 2>/slider-err.txt 2014-05-14 14:48:32,249 [JUnit] INFO launch.AppMasterLauncher (AppMasterLauncher.java:submitApplication(207)) - Submitting application to Resource Manager 2014-05-14 14:48:32,281 [IPC Server handler 2 on 54471] INFO resourcemanager.ClientRMService (ClientRMService.java:submitApplication(537)) - Application with id 1 submitted by user stevel 2014-05-14 14:48:32,281 [AsyncDispatcher event handler] INFO rmapp.RMAppImpl (RMAppImpl.java:transition(863)) - Storing application with id application_1400075308869_0001 2014-05-14 14:48:32,282 [IPC Server handler 2 on 54471] INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(142)) - USER=stevel IP=192.168.1.86 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1400075308869_0001 2014-05-14 14:48:32,283 [AsyncDispatcher event handler] INFO rmapp.RMAppImpl (RMAppImpl.java:handle(639)) - application_1400075308869_0001 State change from NEW to NEW_SAVING 2014-05-14 14:48:32,287 [AsyncDispatcher event handler] INFO recovery.RMStateStore (RMStateStore.java:handleStoreEvent(620)) - Storing info for app: application_1400075308869_0001 2014-05-14 14:48:32,295 [AsyncDispatcher event handler] INFO rmapp.RMAppImpl (RMAppImpl.java:handle(639)) - application_1400075308869_0001 State change from NEW_SAVING to SUBMITTED 2014-05-14 14:48:32,296 [ResourceManager Event Processor] INFO fifo.FifoScheduler (FifoScheduler.java:addApplication(369)) - Accepted application application_1400075308869_0001 from user: stevel, currently num of applications: 1 2014-05-14 14:48:32,297 [ResourceManager Event Processor] FATAL resourcemanager.ResourceManager (ResourceManager.java:run(600)) - Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) at java.lang.Thread.run(Thread.java:745) 2014-05-14 14:48:32,298 [ResourceManager Event Processor] INFO resourcemanager.ResourceManager (ResourceManager.java:run(604)) - Exiting, bbye.. 2014-05-14 14:48:32,298 [AsyncDispatcher event handler] INFO rmapp.RMAppImpl (RMAppImpl.java:handle(639)) - application_1400075308869_0001 State change from SUBMITTED to ACCEPTED 2014-05-14 14:48:32,299 [AsyncDispatcher event handler] INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(608)) - Registering app attempt : appattempt_1400075308869_0001_01 {code} > NPE in RM handling node update while app submission in progress > --- > > Key: YARN-2057 > URL: https://issues.apache.org/jira/browse/YARN-2057 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test > TestDestroyMasterlessAM >Reporter: Steve Loughran > > One of our test runs finished prematurely with an NPE in the RM, followed by > the RM thread calling system.exit(). It looks like an NM update came in while > the app was still being set up, causing confusion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997114#comment-13997114 ] Junping Du commented on YARN-2016: -- bq. Sorry for missing those merge-backs. A simple unit test like here wouldn't have let the mistake happen. No worry. We all make mistakes. :) I am proposing to add a simple unit test like this for any PBImpl changes in future. [~kasha] and all in watch list, Thoughts? > Yarn getApplicationRequest start time range is not honored > -- > > Key: YARN-2016 > URL: https://issues.apache.org/jira/browse/YARN-2016 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Venkat Ranganathan >Assignee: Junping Du > Fix For: 2.4.1 > > Attachments: YARN-2016.patch, YarnTest.java > > > When we query for the previous applications by creating an instance of > GetApplicationsRequest and setting the start time range and application tag, > we see that the start range provided is not honored and all applications with > the tag are returned > Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2057) NPE in RM handling node update while app submission in progress
Steve Loughran created YARN-2057: Summary: NPE in RM handling node update while app submission in progress Key: YARN-2057 URL: https://issues.apache.org/jira/browse/YARN-2057 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test TestDestroyMasterlessAM Reporter: Steve Loughran One of our test runs finished prematurely with an NPE in the RM, followed by the RM thread calling system.exit(). It looks like an NM update came in while the app was still being set up, causing confusion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997266#comment-13997266 ] Gera Shegalov commented on YARN-1515: - Ok, I can work on CMP.signalContainer and replace stopContainers with signalContainer > Ability to dump the container threads and stop the containers in a single RPC > - > > Key: YARN-1515 > URL: https://issues.apache.org/jira/browse/YARN-1515 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, > YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, > YARN-1515.v06.patch, YARN-1515.v07.patch > > > This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for > timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-570: --- Component/s: webapp > Time strings are formated in different timezone > --- > > Key: YARN-570 > URL: https://issues.apache.org/jira/browse/YARN-570 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.2.0 >Reporter: Peng Zhang >Assignee: Akira AJISAKA > Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch > > > Time strings on different page are displayed in different timezone. > If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as > "Wed, 10 Apr 2013 08:29:56 GMT" > If it is formatted by format() in yarn.util.Times, it appears as "10-Apr-2013 > 16:29:56" > Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2053: - Description: Slider AppMaster restart fails with the following: {code} org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) {code} was: Slider AppMaster restart fails with the following: {noformat} 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 48058,address tracking URL=http://c6403.ambari.apache.org:48705 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Exception: java.lang.NullPointerException at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) at org.apache.hadoop.yarn.proto.Ap
[jira] [Commented] (YARN-2053) Slider AM fails to restart
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997459#comment-13997459 ] Steve Loughran commented on YARN-2053: -- Note that this AM requests container retention over AM restarts -so is testing code paths that not much (anything?) else is testing > Slider AM fails to restart > -- > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty > Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {noformat} > 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at > 48058,address tracking URL=http://c6403.ambari.apache.org:48705 > 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > Exception: java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) > at > o
[jira] [Updated] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-1366: - Attachment: YARN-1366.2.patch Synched up offline with Anubhav for doubts mentioned in previous comment. I made changes in MapReduce as well as in AMRMClientImpl 1. reset responseId to 0 2. re register with RM 3. add back all pending request and update blacklisted nodes. > ApplicationMasterService should Resync with the AM upon allocate call after > restart > --- > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997144#comment-13997144 ] Jian He commented on YARN-1368: --- New patch fixed Wangda's comments and also implemented the specific recover methods for FifoScheduler queue. This patch should be rebased on top of YARN-2017. > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Mohanty updated YARN-2053: Attachment: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > Slider AM fails to restart > -- > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty > Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {noformat} > 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at > 48058,address tracking URL=http://c6403.ambari.apache.org:48705 > 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > Exception: java.lang.NullPointerException > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.p
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Attachment: YARN-1368.2.patch > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2053: - Summary: Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts (was: Slider AM fails to restart: NPE in ) > Slider AM fails to restart: NPE in > RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts > > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty > Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {code} > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests
[ https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992900#comment-13992900 ] Chris Riccomini commented on YARN-2027: --- Dug into this a bit more. Not entirely convinced that the TreeSet stuff is actually an issue anymore. RMContainerRequestor.makeRemoteRequest calls: {code} allocateResponse = scheduler.allocate(allocateRequest); {code} If you drill down through the capacity scheduler, into SchedulerApplicationAttempt and AppSchedulingInfo, you'll eventually see that AppSchedulingInfo.updateResourceRequests simply adds the items in "ask" into a map based on priority. The order in which these asks come in seem to always be with ANY first (see above), so updatePendingResources will always be true, but this doesn't seem harmful. Anyway, any ideas why YARN is ignoring host requests? > YARN ignores host-specific resource requests > > > Key: YARN-2027 > URL: https://issues.apache.org/jira/browse/YARN-2027 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.4.0 > Environment: RHEL 6.1 > YARN 2.4 >Reporter: Chris Riccomini > > YARN appears to be ignoring host-level ContainerRequests. > I am creating a container request with code that pretty closely mirrors the > DistributedShell code: > {code} > protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) > { > info("Requesting %d container(s) with %dmb of memory" format (containers, > memMb)) > val capability = Records.newRecord(classOf[Resource]) > val priority = Records.newRecord(classOf[Priority]) > priority.setPriority(0) > capability.setMemory(memMb) > capability.setVirtualCores(cpuCores) > // Specifying a host in the String[] host parameter here seems to do > nothing. Setting relaxLocality to false also doesn't help. > (0 until containers).foreach(idx => amClient.addContainerRequest(new > ContainerRequest(capability, null, null, priority))) > } > {code} > When I run this code with a specific host in the ContainerRequest, YARN does > not honor the request. Instead, it puts the container on an arbitrary host. > This appears to be true for both the FifoScheduler and the CapacityScheduler. > Currently, we are running the CapacityScheduler with the following settings: > {noformat} > > > yarn.scheduler.capacity.maximum-applications > 1 > > Maximum number of applications that can be pending and running. > > > > yarn.scheduler.capacity.maximum-am-resource-percent > 0.1 > > Maximum percent of resources in the cluster which can be used to run > application masters i.e. controls number of concurrent running > applications. > > > > yarn.scheduler.capacity.resource-calculator > > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator > > The ResourceCalculator implementation to be used to compare > Resources in the scheduler. > The default i.e. DefaultResourceCalculator only uses Memory while > DominantResourceCalculator uses dominant-resource to compare > multi-dimensional resources such as Memory, CPU etc. > > > > yarn.scheduler.capacity.root.queues > default > > The queues at the this level (root is the root queue). > > > > yarn.scheduler.capacity.root.default.capacity > 100 > Samza queue target capacity. > > > yarn.scheduler.capacity.root.default.user-limit-factor > 1 > > Default queue user limit a percentage from 0.0 to 1.0. > > > > yarn.scheduler.capacity.root.default.maximum-capacity > 100 > > The maximum capacity of the default queue. > > > > yarn.scheduler.capacity.root.default.state > RUNNING > > The state of the default queue. State can be one of RUNNING or STOPPED. > > > > yarn.scheduler.capacity.root.default.acl_submit_applications > * > > The ACL of who can submit jobs to the default queue. > > > > yarn.scheduler.capacity.root.default.acl_administer_queue > * > > The ACL of who can administer jobs on the default queue. > > > > yarn.scheduler.capacity.node-locality-delay > 40 > > Number of missed scheduling opportunities after which the > CapacityScheduler > attempts to schedule rack-local containers. > Typically this should be set to number of nodes in the cluster, By > default is setting > approximately number of nodes in one rack which is 40. > > > > {noformat} > Digging into the code a bit (props to [~jghoman] for finding this), we have a > theory as to why this is happening. It looks lik
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2056: Description: We need to be able to disable preemption at individual queue level (was: If Queue A does not have enough capacity to run AM, then AM will borrow capacity from queue B to run AM in that case AM will be killed if queue B will reclaim its capacity and again AM will be launched and killed again, in that case job will be failed.) > Disable preemption at Queue level > - > > Key: YARN-2056 > URL: https://issues.apache.org/jira/browse/YARN-2056 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal > Fix For: 2.1.0-beta > > > We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1302) Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996627#comment-13996627 ] Zhijie Shen commented on YARN-1302: --- Anyway, leave it open to see whether it is required to expose the DT access via ApplicationHistoryProtocol as well. > Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol > -- > > Key: YARN-1302 > URL: https://issues.apache.org/jira/browse/YARN-1302 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Like the ApplicationClientProtocol, ApplicationHistoryProtocol needs its own > security stack. We need to implement AHSDelegationTokenSecretManager, > AHSDelegationTokenIndentifier, AHSDelegationTokenSelector and other analogs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2053: - Summary: Slider AM fails to restart: NPE in (was: Slider AM fails to restart) > Slider AM fails to restart: NPE in > --- > > Key: YARN-2053 > URL: https://issues.apache.org/jira/browse/YARN-2053 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sumit Mohanty > Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, > yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak > > > Slider AppMaster restart fails with the following: > {code} > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2053) Slider AM fails to restart
[ https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997461#comment-13997461 ] Steve Loughran commented on YARN-2053: -- {code} {noformat} 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 48058,address tracking URL=http://c6403.ambari.apache.org:48705 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Exception: java.lang.NullPointerException at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at or
[jira] [Updated] (YARN-1049) ContainerExistStatus should define a status for preempted containers
[ https://issues.apache.org/jira/browse/YARN-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1049: -- Issue Type: Sub-task (was: Bug) Parent: YARN-45 > ContainerExistStatus should define a status for preempted containers > > > Key: YARN-1049 > URL: https://issues.apache.org/jira/browse/YARN-1049 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Affects Versions: 2.1.0-beta >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.1.1-beta > > Attachments: YARN-1049.patch > > > With the current behavior is impossible to determine if a container has been > preempted or lost due to a NM crash. > Adding a PREEMPTED exit status (-102) will help an AM determine that a > container has been preempted. > Note the change of scope from the original summary/description. The original > scope proposed API/behavior changes. Because we are passed 2.1.0-beta I'm > reducing the scope of this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997328#comment-13997328 ] Sunil G commented on YARN-2055: --- Hi Mayank, Is this issue same as YARN-2022 ? > Preemption: Jobs are failing due to AMs are getting launched and killed > multiple times > -- > > Key: YARN-2055 > URL: https://issues.apache.org/jira/browse/YARN-2055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal > Fix For: 2.1.0-beta > > > If Queue A does not have enough capacity to run AM, then AM will borrow > capacity from queue B to run AM in that case AM will be killed if queue B > will reclaim its capacity and again AM will be launched and killed again, in > that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1408: -- Issue Type: Sub-task (was: Bug) Parent: YARN-45 > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Fix For: 2.5.0 > > Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, > Yarn-1408.4.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2053) Slider AM fails to restart
Sumit Mohanty created YARN-2053: --- Summary: Slider AM fails to restart Key: YARN-2053 URL: https://issues.apache.org/jira/browse/YARN-2053 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sumit Mohanty Slider AppMaster restart fails with the following: {noformat} 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 48058,address tracking URL=http://c6403.ambari.apache.org:48705 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Exception: java.lang.NullPointerException at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.c
[jira] [Commented] (YARN-1981) Nodemanager version is not updated when a node reconnects
[ https://issues.apache.org/jira/browse/YARN-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996825#comment-13996825 ] Jonathan Eagles commented on YARN-1981: --- +1. lgtm. Committing to branch-2 and trunk. Thanks, [~jlowe]. > Nodemanager version is not updated when a node reconnects > - > > Key: YARN-1981 > URL: https://issues.apache.org/jira/browse/YARN-1981 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1981.patch > > > When a nodemanager is quickly restarted and happens to change versions during > the restart (e.g.: rolling upgrade scenario) the NM version as reported by > the RM is not updated. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1809: -- Target Version/s: 2.5.0 > Synchronize RM and Generic History Service Web-UIs > -- > > Key: YARN-1809 > URL: https://issues.apache.org/jira/browse/YARN-1809 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, > YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch, YARN-1809.6.patch, > YARN-1809.7.patch, YARN-1809.8.patch, YARN-1809.9.patch > > > After YARN-953, the web-UI of generic history service is provide more > information than that of RM, the details about app attempt and container. > It's good to provide similar web-UIs, but retrieve the data from separate > source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
[ https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996965#comment-13996965 ] Jason Lowe commented on YARN-2014: -- HADOOP-7549 added service loading of filesystems, and HADOOP-7350 added service loading of compression codecs. I'll see if I have some time to disable the service loading of unnecessary classes. > Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9 > > > Key: YARN-2014 > URL: https://issues.apache.org/jira/browse/YARN-2014 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: patrick white >Assignee: Jason Lowe > > Performance comparison benchmarks from 2.x against 0.23 shows AM scalability > benchmark's runtime is approximately 10% slower in 2.4.0. The trend is > consistent across later releases in both lines, latest release numbers are: > 2.4.0.0 runtime 255.6 seconds (avg 5 passes) > 0.23.9.12 runtime 230.4 seconds (avg 5 passes) > Diff: -9.9% > AM Scalability test is essentially a sleep job that measures time to launch > and complete a large number of mappers. > The diff is consistent and has been reproduced in both a larger (350 node, > 100,000 mappers) perf environment, as well as a small (10 node, 2,900 > mappers) demo cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times
[ https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2055: -- Summary: Preemption: Jobs are failing due to AMs are getting launched and killed multiple times (was: Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times) > Preemption: Jobs are failing due to AMs are getting launched and killed > multiple times > -- > > Key: YARN-2055 > URL: https://issues.apache.org/jira/browse/YARN-2055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal > Fix For: 2.1.0-beta > > > If Queue A does not have enough capacity to run AM, then AM will borrow > capacity from queue B to run AM in that case AM will be killed if queue B > will reclaim its capacity and again AM will be launched and killed again, in > that case job will be failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997351#comment-13997351 ] Sunil G commented on YARN-2022: --- Thank you Carlo for the clarifications on am-priority and user-limit-factor. I agree with your point on a possible tampering on container priority as 0. On this point, I feel your option 1 may be more ideal (track which container is AM not via Priority). Because even with option 2, AM container has to be found first from multiple containers at Priority=0. In this case save AM first and then save other containers max possible, may not be much suitable with many applications marked for preemption. When an AM container is launched, RM has to set a way to mark it as an AM Container. CapacityScheduler has RMContext, and may be from that with ApplicationAttemptID we can get MasterContainer. I feel this may be little complex look-up. Rather it is better to set some property directly on a container to mark as MasterContainer. Also with user-limit-factor and max-user-percentage, scheduler keeps skipping containers and such an AM is asking for containers again are not so good. And if this AM is a "savedAM" from preemption, it will be even bad. For this also we can place a checkpoint decision whether to save or not. So to summarize roughly, 1) A better marking for finding AM container is needed. [Can see whether this can be extendable to save multiple container of low priority also] 2) A checkpoint has to be derived based on below factors to save an AM or not a. max-am-percentage limit has to be honored. b. user-limit-factor or max-user-percentage also has to be checked. I can first try to post a design approach on deriving checkpoint decision from both a. and b. above. Please share more thoughts if any on this. > Preempting an Application Master container can be kept as least priority when > multiple applications are marked for preemption by > ProportionalCapacityPreemptionPolicy > - > > Key: YARN-2022 > URL: https://issues.apache.org/jira/browse/YARN-2022 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-2022.1.patch > > > Cluster Size = 16GB [2NM's] > Queue A Capacity = 50% > Queue B Capacity = 50% > Consider there are 3 applications running in Queue A which has taken the full > cluster capacity. > J1 = 2GB AM + 1GB * 4 Maps > J2 = 2GB AM + 1GB * 4 Maps > J3 = 2GB AM + 1GB * 2 Maps > Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. > Currently in this scenario, Jobs J3 will get killed including its AM. > It is better if AM can be given least priority among multiple applications. > In this same scenario, map tasks from J3 and J2 can be preempted. > Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()
[ https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997334#comment-13997334 ] Sandy Ryza commented on YARN-2042: -- +1 > String shouldn't be compared using == in > QueuePlacementRule#NestedUserQueue#getQueueForApp() > > > Key: YARN-2042 > URL: https://issues.apache.org/jira/browse/YARN-2042 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ted Yu >Assignee: Chen He >Priority: Minor > Attachments: YARN-2042.patch > > > {code} > if (queueName != null && queueName != "") { > {code} > queueName.isEmpty() should be used instead of comparing against "" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1986) In Fifo Scheduler, node heartbeat in between creating app and attempt causes NPE
[ https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-1986: - Assignee: Hong Zhiguo (was: Sandy Ryza) > In Fifo Scheduler, node heartbeat in between creating app and attempt causes > NPE > > > Key: YARN-1986 > URL: https://issues.apache.org/jira/browse/YARN-1986 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Jon Bringhurst >Assignee: Hong Zhiguo >Priority: Critical > Attachments: YARN-1986-2.patch, YARN-1986-3.patch, > YARN-1986-testcase.patch, YARN-1986.patch > > > After upgrade from 2.2.0 to 2.4.0, NPE on first job start. > -After RM was restarted, the job runs without a problem.- > {noformat} > 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type > NODE_UPDATE to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) > at java.lang.Thread.run(Thread.java:744) > 19:11:13,443 INFO ResourceManager:604 - Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times
Mayank Bansal created YARN-2055: --- Summary: Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times Key: YARN-2055 URL: https://issues.apache.org/jira/browse/YARN-2055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Sunil G Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)