[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998337#comment-13998337
 ] 

Sandy Ryza commented on YARN-2054:
--

+1

> Poor defaults for YARN ZK configs for retries and retry-inteval
> ---
>
> Key: YARN-2054
> URL: https://issues.apache.org/jira/browse/YARN-2054
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-2054-1.patch
>
>
> Currenly, we have the following default values:
> # yarn.resourcemanager.zk-num-retries - 500
> # yarn.resourcemanager.zk-retry-interval-ms - 2000
> This leads to a cumulate 1000 seconds before the RM gives up trying to 
> connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts

2014-05-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998324#comment-13998324
 ] 

Wangda Tan commented on YARN-2053:
--

Sure, I'll do that, thanks for review!

> Slider AM fails to restart: NPE in 
> RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
> 
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
>Assignee: Wangda Tan
> Attachments: YARN-2053.patch, 
> yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {code}
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover

2014-05-14 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-2062:
--

 Summary: Too many InvalidStateTransitionExceptions from 
NodeState.NEW on RM failover
 Key: YARN-2062
 URL: https://issues.apache.org/jira/browse/YARN-2062
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


On busy clusters, we see several 
{{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events 
invoked against NEW nodes. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1975) Used resources shows escaped html in CapacityScheduler and FairScheduler page

2014-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998172#comment-13998172
 ] 

Hudson commented on YARN-1975:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1779 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1779/])
YARN-1975. Fix yarn application CLI to print the scheme of the tracking url of 
failed/killed applications. Contributed by Junping Du (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1593874)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


> Used resources shows escaped html in CapacityScheduler and FairScheduler page
> -
>
> Key: YARN-1975
> URL: https://issues.apache.org/jira/browse/YARN-1975
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Nathan Roberts
>Assignee: Mit Desai
> Fix For: 3.0.0, 2.4.1
>
> Attachments: YARN-1975.patch, screenshot-1975.png
>
>
> Used resources displays as <memory:, vCores;> with capacity 
> scheduler



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2061) Revisit logging levels in ZKRMStateStore

2014-05-14 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-2061:
--

 Summary: Revisit logging levels in ZKRMStateStore 
 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor


ZKRMStateStore has a few places where it is logging at the INFO level. We 
should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts

2014-05-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997804#comment-13997804
 ] 

Jian He commented on YARN-2053:
---

Thanks for catching this !   Patch looks good to me, Wangda, can you add a test 
case ?
 Basically, allocate 2 containers on the same node in 
TestAMRestart#testNMTokensRebindOnAMRestart should be enough.

> Slider AM fails to restart: NPE in 
> RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
> 
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
>Assignee: Wangda Tan
> Attachments: YARN-2053.patch, 
> yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {code}
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN

2014-05-14 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created YARN-2041:
--

 Summary: Hard to co-locate MR2 and Spark jobs on the same cluster 
in YARN
 Key: YARN-2041
 URL: https://issues.apache.org/jira/browse/YARN-2041
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Nishkam Ravi
 Fix For: 2.4.0, 2.3.0


Performance of MR2 jobs falls drastically as YARN config parameter 
yarn.nodemanager.resource.memory-mb  is increased beyond a certain value. 

Performance of Spark falls drastically as the value of 
yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a 
large data set.

This makes it hard to co-locate MR2 and Spark jobs in YARN.

The experiments are being conducted on a 6-node cluster. The following 
workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, 
ShuffleText and PageRank.

Will add more details to this JIRA over time.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997132#comment-13997132
 ] 

Jian He commented on YARN-1368:
---

New patch uploaded.

> Common work to re-populate containers’ state into scheduler
> ---
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Jian He
> Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
> YARN-1368.combined.001.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1976) Tracking url missing http protocol for FAILED application

2014-05-14 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994158#comment-13994158
 ] 

Junping Du commented on YARN-1976:
--

Hi [~jianhe], would you mind to review it again? Thx!

> Tracking url missing http protocol for FAILED application
> -
>
> Key: YARN-1976
> URL: https://issues.apache.org/jira/browse/YARN-1976
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Junping Du
> Attachments: YARN-1976-v2.patch, YARN-1976.patch
>
>
> Run yarn application -list -appStates FAILED,  It does not print http 
> protocol name like FINISHED apps.
> {noformat}
> -bash-4.1$ yarn application -list -appStates FINISHED,FAILED,KILLED
> 14/04/15 23:55:07 INFO client.RMProxy: Connecting to ResourceManager at host
> Total number of applications (application-types: [] and states: [FINISHED, 
> FAILED, KILLED]):4
> Application-IdApplication-Name
> Application-Type  User   Queue   State
>  Final-State ProgressTracking-URL
> application_1397598467870_0004   Sleep job   
> MAPREDUCEhrt_qa defaultFINISHED   
> SUCCEEDED 100% 
> http://host:19888/jobhistory/job/job_1397598467870_0004
> application_1397598467870_0003   Sleep job   
> MAPREDUCEhrt_qa defaultFINISHED   
> SUCCEEDED 100% 
> http://host:19888/jobhistory/job/job_1397598467870_0003
> application_1397598467870_0002   Sleep job   
> MAPREDUCEhrt_qa default  FAILED   
>FAILED 100% 
> host:8088/cluster/app/application_1397598467870_0002
> application_1397598467870_0001  word count   
> MAPREDUCEhrt_qa defaultFINISHED   
> SUCCEEDED 100% 
> http://host:19888/jobhistory/job/job_1397598467870_0001
> {noformat}
> It only prints 'host:8088/cluster/app/application_1397598467870_0002' instead 
> 'http://host:8088/cluster/app/application_1397598467870_0002' 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992711#comment-13992711
 ] 

Hadoop QA commented on YARN-2022:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643727/Yarn-2022.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3718//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3718//console

This message is automatically generated.

> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> -
>
> Key: YARN-2022
> URL: https://issues.apache.org/jira/browse/YARN-2022
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Yarn-2022.1.patch
>
>
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J2 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2057) NPE in RM handling node update while app submission in progress

2014-05-14 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved YARN-2057.
--

Resolution: Duplicate

> NPE in RM handling node update while app submission in progress
> ---
>
> Key: YARN-2057
> URL: https://issues.apache.org/jira/browse/YARN-2057
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test 
> TestDestroyMasterlessAM
>Reporter: Steve Loughran
>
> One of our test runs finished prematurely with an NPE in the RM, followed by 
> the RM thread calling system.exit(). It looks like an NM update came in while 
> the app was still being set up, causing confusion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2057) NPE in RM handling node update while app submission in progress

2014-05-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997762#comment-13997762
 ] 

Steve Loughran commented on YARN-2057:
--

you're right, the stack trace and scenario -"heartbeat during app launch" seems 
to match; closing as duplicate

> NPE in RM handling node update while app submission in progress
> ---
>
> Key: YARN-2057
> URL: https://issues.apache.org/jira/browse/YARN-2057
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test 
> TestDestroyMasterlessAM
>Reporter: Steve Loughran
>
> One of our test runs finished prematurely with an NPE in the RM, followed by 
> the RM thread calling system.exit(). It looks like an NM update came in while 
> the app was still being set up, causing confusion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-14 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2055:


Description: If Queue A does not have enough capacity to run AM, then AM 
will borrow capacity from queue B to run AM in that case AM will be killed if 
queue B will reclaim its capacity and again AM will be launched and killed 
again, in that case job will be failed.  (was: Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full 
cluster capacity. 
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps

Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.

It is better if AM can be given least priority among multiple applications. In 
this same scenario, map tasks from J3 and J2 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.)

> Preemtion: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> -
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
> Fix For: 2.1.0-beta
>
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997386#comment-13997386
 ] 

Wangda Tan commented on YARN-1368:
--

Hi Jian,
Thanks for updating your patch, I took a look at it, some comments,
1) RMAppImpl.java
{code}
+  // Add the current attempt to the scheduler.It'll be removed from
+  // scheduler in RMAppAttempt#BaseFinalTransition
+  app.handler.handle(new AppAttemptAddedSchedulerEvent(app.currentAttempt
+.getAppAttemptId(), false));
{code}
Not quite understand about this, in current trunk code, how RMAppAttempt notify 
scheduler about "AppAttemptAddedSchedulerEvent" when recovering? If this is a 
missing part in current trunk, I tend to put sending 
"AppAttemptAddedSchedulerEvent" code into RMAppAttemptImpl.

2) RMAppAttemptImpl.java
{code}
+   new ContainerFinishedTransition(
+  new AMContainerCrashedAtRunningTransition(),
+  RMAppAttemptState.RUNNING))
{code}
And
{code}
+   new ContainerFinishedTransition(
+  new AMContainerCrashedBeforeRunningTransition(),
+  RMAppAttemptState.LAUNCHED))
{code}
I found the "RUNNING" and "LAUNCHED" state are passed in as targetedFinalState, 
and the targetFinalState will only used in FinalStateSavedTransition, I got 
confused, could you please elaborate on this? Why use split 
AMContainerCrashedTransition to two transitions and set their states to 
RUNNING/LAUNCHED differently.

3) In AbstractYarnScheduler.java
3.1 
{code}
+  if (rmApp == null) {
+LOG.error("Skip recovering container " + status
++ " for unknown application.");
+continue;
+  }
{code}
And
{code}
+  if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) {
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Skip recovering container " + status
+  + " for unmanaged AM." + rmApp.getApplicationId());
+}
+continue;
+  }
{code}
And
{code}
+  SchedulerApplication schedulerApp = applications.get(appId);
+  if (schedulerApp == null) {
+LOG.info("Skip recovering container  " + status
++ " for unknown SchedulerApplication. Application state is "
++ rmApp.getState());
+continue;
+  }
{code}
It's better to make log level more consistency.

3.2
{code}
+  public RMContainer createContainer(ContainerStatus status, RMNode node) {
+Container container =
+Container.newInstance(status.getContainerId(), node.getNodeID(),
+  node.getHttpAddress(), Resource.newInstance(1024, 1),
+  Priority.newInstance(0), null);
{code}
Should we change Resource(1024, 1) to its actually resource?

3.3 For recoverContainersOnNode, is it possible NODE_ADDED happened before 
APP_ADDED? I ask this before container may be recovered before its application 
added to scheduler if yes.

4) In FiCaSchedulerNode.java:
{code}
+  @Override
+  public void recoverContainer(RMContainer rmContainer) {
+if (rmContainer.getState().equals(RMContainerState.COMPLETED)) {
+  return;
+}
+allocateContainer(null, rmContainer);
+  }
{code}
Since the allocateContainer doesn't use application-id parameter, I think it's 
better to remove it.

5) In TestWorkPreservingRMRestart.java
{code}
+assertEquals(usedCapacity, queue.getAbsoluteUsedCapacity(), 0);
{code}
It may better to use two parameter assertEquals, because delta is 0

> Common work to re-populate containers’ state into scheduler
> ---
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Jian He
> Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
> YARN-1368.combined.001.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994162#comment-13994162
 ] 

Hadoop QA commented on YARN-2016:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644248/YARN-2016.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3730//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3730//console

This message is automatically generated.

> Yarn getApplicationRequest start time range is not honored
> --
>
> Key: YARN-2016
> URL: https://issues.apache.org/jira/browse/YARN-2016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Venkat Ranganathan
>Assignee: Junping Du
> Attachments: YARN-2016.patch, YarnTest.java
>
>
> When we query for the previous applications by creating an instance of 
> GetApplicationsRequest and setting the start time range and application tag, 
> we see that the start range provided is not honored and all applications with 
> the tag are returned
> Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994014#comment-13994014
 ] 

Karthik Kambatla commented on YARN-1861:


I am obviously a +1 because I wrote the patch. Can someone other than Xuan and 
me take a look? 

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996780#comment-13996780
 ] 

Karthik Kambatla commented on YARN-2016:


Sorry for missing those merge-backs. A simple unit test like here wouldn't have 
let the mistake happen.

> Yarn getApplicationRequest start time range is not honored
> --
>
> Key: YARN-2016
> URL: https://issues.apache.org/jira/browse/YARN-2016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Venkat Ranganathan
>Assignee: Junping Du
> Fix For: 2.4.1
>
> Attachments: YARN-2016.patch, YarnTest.java
>
>
> When we query for the previous applications by creating an instance of 
> GetApplicationsRequest and setting the start time range and application tag, 
> we see that the start range provided is not honored and all applications with 
> the tag are returned
> Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2014-05-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996494#comment-13996494
 ] 

Jason Lowe commented on YARN-2014:
--

I did a bit of investigation on this, and the problem appears to be around the 
duration of the tasks.  In 2.4 the sleep job tasks are taking about 660 msec 
longer to execute than they do in 0.23.  I didn't nail down exactly where this 
extra delay was coming from, but I did notice that the tasks in 2.4 are loading 
over 800 more classes than they do in 0.23.  I think most of these are coming 
from the service loader for FileSystem schemas, as the 2.4 tasks loads every 
FileSystem available and 0.23 does not.  In 0.23 FileSystem schemas are 
declared in configs, but in 2.4 they are dynamically detected and loaded via a 
service loader.

The ~0.5s delay in the task appears to be a fixed startup cost and is amplified 
by the AM scalability test since it runs very short tasks (the main portion of 
the map task lasts 1 second) and multiple tasks are run per map "slot" on the 
cluster, serializing the task startup delays.

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext

2014-05-14 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-2050:
--

Attachment: YARN-2050-2.patch

Thanks, Jason. You are right. remoteAppLogDir could point to a different type 
of file system. Here is the updated patch.

> Fix LogCLIHelpers to create the correct FileContext
> ---
>
> Key: YARN-2050
> URL: https://issues.apache.org/jira/browse/YARN-2050
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: YARN-2050-2.patch, YARN-2050.patch
>
>
> LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
> the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts

2014-05-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997621#comment-13997621
 ] 

Wangda Tan commented on YARN-2053:
--

Took a look at related code, I think this problem is caused by,

In ApplicationMasterService.registerApplicationMaster(), it will add nmTokens 
from previous attempt's container via a loop.
{code}
  List transferredContainers =
  ((AbstractYarnScheduler) rScheduler)
.getTransferredContainers(applicationAttemptId);
  if (!transferredContainers.isEmpty()) {
response.setContainersFromPreviousAttempts(transferredContainers);
List nmTokens = new ArrayList();
for (Container container : transferredContainers) {
  try {
nmTokens.add(rmContext.getNMTokenSecretManager()
.createAndGetNMToken(app.getUser(), applicationAttemptId,
container););
  }
{code}

And NMTokenSecretManager.createAndGetNMToken()
{code}
  NMToken nmToken = null;
  if (nodeSet != null) {
if (!nodeSet.contains(container.getNodeId())) {
   ...
   // set nmToken
   ...
}
  }
  return nmToken
{code}

So if multiple container come from same NM (with same NodeId), null nmToken 
will be added to NMToken list. And in 
RegisterApplicationMasterResponsePBImpl.getTokenProtoIterable, it tried to 
convert a null NMToken to proto
{code}
  @Override
  public NMTokenProto next() {
return convertToProtoFormat(iter.next());
  }
{code}

I think this should be the root cause of this problem, uploaded a patch.

> Slider AM fails to restart: NPE in 
> RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
> 
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
> Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {code}
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup

2014-05-14 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-2036:
-

Attachment: YARN2036-02.patch

First revision based on comments.

> Document yarn.resourcemanager.hostname in ClusterSetup
> --
>
> Key: YARN-2036
> URL: https://issues.apache.org/jira/browse/YARN-2036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Ray Chiang
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: YARN2036-01.patch, YARN2036-02.patch
>
>
> ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people 
> should just be able to use that directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-14 Thread Venkat Ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997187#comment-13997187
 ] 

Venkat Ranganathan commented on YARN-2016:
--

[~djp]  It would be good to have a unit test as I mentioned before.The test 
case I uploaded was specific to one issue, but tests with directions of the 
wire transfers and something like that would be also.  May be that is something 
I will consider adding

> Yarn getApplicationRequest start time range is not honored
> --
>
> Key: YARN-2016
> URL: https://issues.apache.org/jira/browse/YARN-2016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Venkat Ranganathan
>Assignee: Junping Du
> Fix For: 2.4.1
>
> Attachments: YARN-2016.patch, YarnTest.java
>
>
> When we query for the previous applications by creating an instance of 
> GetApplicationsRequest and setting the start time range and application tag, 
> we see that the start range provided is not honored and all applications with 
> the tag are returned
> Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-667) Data persisted in RM should be versioned

2014-05-14 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996610#comment-13996610
 ] 

Zhijie Shen commented on YARN-667:
--

bq. do you have sense on how possibility it could happen in 2.X as you are 
currently work on ATS?

My understanding is that we may have two 2 possible changes in the future:

1. One is data itself. Say if application has one more state in the future, the 
new RM will try to persist it into the history store, while the old history 
server or client may not understand it. This should be a common problem with 
RMStateStore. This is driven by RM itself.

2. The other is change of the timeline server internal. Say in the near future, 
we modified the file structure in FileSystemApplicationHistoryStore to improve 
the performance, new FileSystemApplicationHistoryStore may not longer 
understand the existing data structure written by old 
FileSystemApplicationHistoryStore. However, I think this part should be taken 
care of by the timeline server itself.

> Data persisted in RM should be versioned
> 
>
> Key: YARN-667
> URL: https://issues.apache.org/jira/browse/YARN-667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.0.4-alpha
>Reporter: Siddharth Seth
>Assignee: Junping Du
>
> Includes data persisted for RM restart, NodeManager directory structure and 
> the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997792#comment-13997792
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

Could someone help to start Jenkins job? It seems not to work.

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
> YARN-1474.8.patch, YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-14 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996577#comment-13996577
 ] 

Zhijie Shen commented on YARN-2048:
---

bq.  the only implementation of ApplcationContext is 
ApplicationHistoryManagerImpl, which retrieves containers information from 
history store. 

It's a historic problem of history service development. We plan to have a 
common interface for both RM and history server to retrieve the app information.

bq. How do you fetch the containers info from a historyserver and display it on 
the RM web?

We are supposed to provide the consistent RM and history web UI, but RM web UI 
shows the running and cached completed apps, while history web UI shows the 
completed apps.

bq. If the information is from history store, seems RM won't get that kind of 
info until the application is done? Sometimes user's application might be a 
long-live application, never finish unless user kill it.

Not exactly. RM web UI will always show the running app information. Currently, 
history web UI will only show the app information once the app is completed due 
to the current history store implementation. After we rebase the store on top 
of timeline store, we should get rid of the issue.

bq. Seems the only way providing containers info to RM is to maintain a list in 
RMAppAttempImpl, which was my way as well.

Months ago, I define ApplcationContext to be common interface for retrieve the 
app information, including container. Recently, released in 2.4.0, we have 
already supported the analog RPC interfaces in both RM and history server to 
retrieve app information, again including getContainer(s). This is reason why 
in YARN-1809, I'd like to rebase both RM and history web UI to retrieve the app 
information from the analog RPC interfaces. In this way, both RM and history 
web UI are showing consistent app information via both CLI and web pages. Of 
course, we'd like to make REST APIs uniformed as well.

> List all of the containers of an application from the yarn web
> --
>
> Key: YARN-2048
> URL: https://issues.apache.org/jira/browse/YARN-2048
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, webapp
>Affects Versions: 2.3.0, 2.4.0, 2.5.0
>Reporter: Min Zhou
> Attachments: YARN-2048-trunk-v1.patch
>
>
> Currently, Yarn haven't provide a way to list all of the containers of an 
> application from its web. This kind of information is needed by the 
> application user. They can conveniently know how many containers their 
> applications already acquired as well as which nodes those containers were 
> launched on.  They also want to view the logs of each container of an 
> application.
> One approach is maintain a container list in RMAppImpl and expose this info 
> to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-14 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997722#comment-13997722
 ] 

Bikas Saha commented on YARN-1366:
--

Is there any value in combining the re-register and re-sending of pending 
requests in 1 new "resync" method? I am not arguing in favor of it but it would 
help if we evaluate the pros/cons and go through the mental exercise of how 
things would work on the AM and RM side. This is important because we making 
API changes and these are hard to undo.
e.g. pro of new resync method - API clearly specifies that pending requests 
must be re-submitted. Are there any other advantage on the RM side by having 
this information come together in 1 "atomic" operation? Does it help the RM to 
differentiate between an AM that was launched and had registered vs an AM that 
had been launched but the RM died before the AM could register. Is that 
important in any case?

> ApplicationMasterService should Resync with the AM upon allocate call after 
> restart
> ---
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
> YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992698#comment-13992698
 ] 

Hadoop QA commented on YARN-1474:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643920/YARN-1474.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3717//console

This message is automatically generated.

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
> YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
> YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests

2014-05-14 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997708#comment-13997708
 ] 

Chris Riccomini commented on YARN-2027:
---

relaxLocality was set to false.

{noformat}
(0 until containers).foreach(idx => amClient.addContainerRequest(new 
ContainerRequest(capability, getHosts, List("/default-rack").toArray[String], 
priority, false)))
{noformat}

The last false in that parameter list is relaxLocality.

> YARN ignores host-specific resource requests
> 
>
> Key: YARN-2027
> URL: https://issues.apache.org/jira/browse/YARN-2027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.4.0
> Environment: RHEL 6.1
> YARN 2.4
>Reporter: Chris Riccomini
>
> YARN appears to be ignoring host-level ContainerRequests.
> I am creating a container request with code that pretty closely mirrors the 
> DistributedShell code:
> {code}
>   protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) 
> {
> info("Requesting %d container(s) with %dmb of memory" format (containers, 
> memMb))
> val capability = Records.newRecord(classOf[Resource])
> val priority = Records.newRecord(classOf[Priority])
> priority.setPriority(0)
> capability.setMemory(memMb)
> capability.setVirtualCores(cpuCores)
> // Specifying a host in the String[] host parameter here seems to do 
> nothing. Setting relaxLocality to false also doesn't help.
> (0 until containers).foreach(idx => amClient.addContainerRequest(new 
> ContainerRequest(capability, null, null, priority)))
>   }
> {code}
> When I run this code with a specific host in the ContainerRequest, YARN does 
> not honor the request. Instead, it puts the container on an arbitrary host. 
> This appears to be true for both the FifoScheduler and the CapacityScheduler.
> Currently, we are running the CapacityScheduler with the following settings:
> {noformat}
> 
>   
> yarn.scheduler.capacity.maximum-applications
> 1
> 
>   Maximum number of applications that can be pending and running.
> 
>   
>   
> yarn.scheduler.capacity.maximum-am-resource-percent
> 0.1
> 
>   Maximum percent of resources in the cluster which can be used to run
>   application masters i.e. controls number of concurrent running
>   applications.
> 
>   
>   
> yarn.scheduler.capacity.resource-calculator
> 
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
> 
>   The ResourceCalculator implementation to be used to compare
>   Resources in the scheduler.
>   The default i.e. DefaultResourceCalculator only uses Memory while
>   DominantResourceCalculator uses dominant-resource to compare
>   multi-dimensional resources such as Memory, CPU etc.
> 
>   
>   
> yarn.scheduler.capacity.root.queues
> default
> 
>   The queues at the this level (root is the root queue).
> 
>   
>   
> yarn.scheduler.capacity.root.default.capacity
> 100
> Samza queue target capacity.
>   
>   
> yarn.scheduler.capacity.root.default.user-limit-factor
> 1
> 
>   Default queue user limit a percentage from 0.0 to 1.0.
> 
>   
>   
> yarn.scheduler.capacity.root.default.maximum-capacity
> 100
> 
>   The maximum capacity of the default queue.
> 
>   
>   
> yarn.scheduler.capacity.root.default.state
> RUNNING
> 
>   The state of the default queue. State can be one of RUNNING or STOPPED.
> 
>   
>   
> yarn.scheduler.capacity.root.default.acl_submit_applications
> *
> 
>   The ACL of who can submit jobs to the default queue.
> 
>   
>   
> yarn.scheduler.capacity.root.default.acl_administer_queue
> *
> 
>   The ACL of who can administer jobs on the default queue.
> 
>   
>   
> yarn.scheduler.capacity.node-locality-delay
> 40
> 
>   Number of missed scheduling opportunities after which the 
> CapacityScheduler
>   attempts to schedule rack-local containers.
>   Typically this should be set to number of nodes in the cluster, By 
> default is setting
>   approximately number of nodes in one rack which is 40.
> 
>   
> 
> {noformat}
> Digging into the code a bit (props to [~jghoman] for finding this), we have a 
> theory as to why this is happening. It looks like 
> RMContainerRequestor.addContainerReq adds three resource requests per 
> container request: data-local, rack-local, and any:
> {code}
> protected void addContainerReq(ContainerRequest req) {
>   // Create resource requests
>   for (String host : req.hosts) {
> // Data-local
> if (!isNodeBlacklisted(host)) {
>   addResourceRequest(req.priority, host, req.capability);
> }  

[jira] [Commented] (YARN-2011) Typo in TestLeafQueue

2014-05-14 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993395#comment-13993395
 ] 

Junping Du commented on YARN-2011:
--

Nice catch, [~airbots]! There is also a warning that following code in 
testAppAttemptMetrics() is never used:
{code}
FiCaSchedulerApp app_0 = new FiCaSchedulerApp(appAttemptId_0, user_0, a, 
null,
rmContext);
{code}
Given you already there, would you like to fix this warning in your patch as 
well? I will review and commit it. Thanks!

> Typo in TestLeafQueue
> -
>
> Key: YARN-2011
> URL: https://issues.apache.org/jira/browse/YARN-2011
> Project: Hadoop YARN
>  Issue Type: Test
>Affects Versions: 2.4.0
>Reporter: Chen He
>Assignee: Chen He
>Priority: Trivial
> Attachments: YARN-2011.patch
>
>
> a.assignContainers(clusterResource, node_0);
> assertEquals(2*GB, a.getUsedResources().getMemory());
> assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
> assertEquals(0*GB, app_1.getCurrentConsumption().getMemory());
> assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
> assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
> // Again one to user_0 since he hasn't exceeded user limit yet
> a.assignContainers(clusterResource, node_0);
> assertEquals(3*GB, a.getUsedResources().getMemory());
> assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
> assertEquals(1*GB, app_1.getCurrentConsumption().getMemory());
> assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G
> assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2057) NPE in RM handling node update while app submission in progress

2014-05-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997635#comment-13997635
 ] 

Wangda Tan commented on YARN-2057:
--

[~ste...@apache.org], I think this is a same issue which is already resolved by 
YARN-1986. You can take a look and close it if you think so.

> NPE in RM handling node update while app submission in progress
> ---
>
> Key: YARN-2057
> URL: https://issues.apache.org/jira/browse/YARN-2057
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test 
> TestDestroyMasterlessAM
>Reporter: Steve Loughran
>
> One of our test runs finished prematurely with an NPE in the RM, followed by 
> the RM thread calling system.exit(). It looks like an NM update came in while 
> the app was still being set up, causing confusion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server

2014-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995745#comment-13995745
 ] 

Hadoop QA commented on YARN-1938:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644488/YARN-1938.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3735//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3735//console

This message is automatically generated.

> Kerberos authentication for the timeline server
> ---
>
> Key: YARN-1938
> URL: https://issues.apache.org/jira/browse/YARN-1938
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1938.1.patch, YARN-1938.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-14 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-2054:
--

 Summary: Poor defaults for YARN ZK configs for retries and 
retry-inteval
 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


Currenly, we have the following default values:
# yarn.resourcemanager.zk-num-retries - 500
# yarn.resourcemanager.zk-retry-interval-ms - 2000

This leads to a cumulate 1000 seconds before the RM gives up trying to connect 
to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2057) NPE in RM handling node update while app submission in progress

2014-05-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997557#comment-13997557
 ] 

Steve Loughran commented on YARN-2057:
--

Stack trace. This is a transient failure -and didn't reoccur immediately after- 
so the log is all there is, I'm afraid

Can I note that failing to handle status updates by triggering RM exit is a bit 
brittle. It exposes the RM to failover if a single NM starts sending in bad 
data.

{code}
2014-05-14 14:48:32,248 [JUnit] DEBUG launch.AbstractLauncher 
(AbstractLauncher.java:completeContainerLaunch(162)) - Completed setting up 
container command $JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true 
-Djava.awt.headless=true -Xmx256M -ea -esa 
org.apache.slider.server.appmaster.SliderAppMaster create 
test_destroy_masterless_am --debug -cluster-uri 
file:/Users/stevel/.slider/cluster/test_destroy_masterless_am --rm 
192.168.1.86:54470 --fs file:/// -D slider.registry.path=/registry -D 
slider.zookeeper.quorum=localhost:1 1>/slider-out.txt 
2>/slider-err.txt 
2014-05-14 14:48:32,249 [JUnit] INFO  launch.AppMasterLauncher 
(AppMasterLauncher.java:submitApplication(207)) - Submitting application to 
Resource Manager
2014-05-14 14:48:32,281 [IPC Server handler 2 on 54471] INFO  
resourcemanager.ClientRMService (ClientRMService.java:submitApplication(537)) - 
Application with id 1 submitted by user stevel
2014-05-14 14:48:32,281 [AsyncDispatcher event handler] INFO  rmapp.RMAppImpl 
(RMAppImpl.java:transition(863)) - Storing application with id 
application_1400075308869_0001
2014-05-14 14:48:32,282 [IPC Server handler 2 on 54471] INFO  
resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(142)) - 
USER=stevel  IP=192.168.1.86 OPERATION=Submit Application Request
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1400075308869_0001
2014-05-14 14:48:32,283 [AsyncDispatcher event handler] INFO  rmapp.RMAppImpl 
(RMAppImpl.java:handle(639)) - application_1400075308869_0001 State change from 
NEW to NEW_SAVING
2014-05-14 14:48:32,287 [AsyncDispatcher event handler] INFO  
recovery.RMStateStore (RMStateStore.java:handleStoreEvent(620)) - Storing info 
for app: application_1400075308869_0001
2014-05-14 14:48:32,295 [AsyncDispatcher event handler] INFO  rmapp.RMAppImpl 
(RMAppImpl.java:handle(639)) - application_1400075308869_0001 State change from 
NEW_SAVING to SUBMITTED
2014-05-14 14:48:32,296 [ResourceManager Event Processor] INFO  
fifo.FifoScheduler (FifoScheduler.java:addApplication(369)) - Accepted 
application application_1400075308869_0001 from user: stevel, currently num of 
applications: 1
2014-05-14 14:48:32,297 [ResourceManager Event Processor] FATAL 
resourcemanager.ResourceManager (ResourceManager.java:run(600)) - Error in 
handling event type NODE_UPDATE to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
at java.lang.Thread.run(Thread.java:745)
2014-05-14 14:48:32,298 [ResourceManager Event Processor] INFO  
resourcemanager.ResourceManager (ResourceManager.java:run(604)) - Exiting, 
bbye..
2014-05-14 14:48:32,298 [AsyncDispatcher event handler] INFO  rmapp.RMAppImpl 
(RMAppImpl.java:handle(639)) - application_1400075308869_0001 State change from 
SUBMITTED to ACCEPTED
2014-05-14 14:48:32,299 [AsyncDispatcher event handler] INFO  
resourcemanager.ApplicationMasterService 
(ApplicationMasterService.java:registerAppAttempt(608)) - Registering app 
attempt : appattempt_1400075308869_0001_01

{code}

> NPE in RM handling node update while app submission in progress
> ---
>
> Key: YARN-2057
> URL: https://issues.apache.org/jira/browse/YARN-2057
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test 
> TestDestroyMasterlessAM
>Reporter: Steve Loughran
>
> One of our test runs finished prematurely with an NPE in the RM, followed by 
> the RM thread calling system.exit(). It looks like an NM update came in while 
> the app was still being set up, causing confusion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-14 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997114#comment-13997114
 ] 

Junping Du commented on YARN-2016:
--

bq. Sorry for missing those merge-backs. A simple unit test like here wouldn't 
have let the mistake happen.
No worry. We all make mistakes. :) I am proposing to add a simple unit test 
like this for any PBImpl changes in future. [~kasha] and all in watch list, 
Thoughts?

> Yarn getApplicationRequest start time range is not honored
> --
>
> Key: YARN-2016
> URL: https://issues.apache.org/jira/browse/YARN-2016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Venkat Ranganathan
>Assignee: Junping Du
> Fix For: 2.4.1
>
> Attachments: YARN-2016.patch, YarnTest.java
>
>
> When we query for the previous applications by creating an instance of 
> GetApplicationsRequest and setting the start time range and application tag, 
> we see that the start range provided is not honored and all applications with 
> the tag are returned
> Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2057) NPE in RM handling node update while app submission in progress

2014-05-14 Thread Steve Loughran (JIRA)
Steve Loughran created YARN-2057:


 Summary: NPE in RM handling node update while app submission in 
progress
 Key: YARN-2057
 URL: https://issues.apache.org/jira/browse/YARN-2057
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: OS/X, hadoop 2.4.0 mini yarn cluster, slider unit test 
TestDestroyMasterlessAM
Reporter: Steve Loughran


One of our test runs finished prematurely with an NPE in the RM, followed by 
the RM thread calling system.exit(). It looks like an NM update came in while 
the app was still being set up, causing confusion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-14 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997266#comment-13997266
 ] 

Gera Shegalov commented on YARN-1515:
-

Ok, I can work on CMP.signalContainer and replace stopContainers with 
signalContainer

> Ability to dump the container threads and stop the containers in a single RPC
> -
>
> Key: YARN-1515
> URL: https://issues.apache.org/jira/browse/YARN-1515
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
> YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
> YARN-1515.v06.patch, YARN-1515.v07.patch
>
>
> This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
> timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-570) Time strings are formated in different timezone

2014-05-14 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-570:
---

Component/s: webapp

> Time strings are formated in different timezone
> ---
>
> Key: YARN-570
> URL: https://issues.apache.org/jira/browse/YARN-570
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.2.0
>Reporter: Peng Zhang
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch
>
>
> Time strings on different page are displayed in different timezone.
> If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
> "Wed, 10 Apr 2013 08:29:56 GMT"
> If it is formatted by format() in yarn.util.Times, it appears as "10-Apr-2013 
> 16:29:56"
> Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2053) Slider AM fails to restart

2014-05-14 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2053:
-

Description: 
Slider AppMaster restart fails with the following:
{code}
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
{code}

  was:
Slider AppMaster restart fails with the following:

{noformat}
14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 
48058,address tracking URL=http://c6403.ambari.apache.org:48705
14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

Exception: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
at 
org.apache.hadoop.yarn.proto.Ap

[jira] [Commented] (YARN-2053) Slider AM fails to restart

2014-05-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997459#comment-13997459
 ] 

Steve Loughran commented on YARN-2053:
--

Note that this AM requests container retention over AM restarts -so is testing 
code paths that not much (anything?) else is testing

> Slider AM fails to restart
> --
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
> Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {noformat}
> 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 
> 48058,address tracking URL=http://c6403.ambari.apache.org:48705
> 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Exception: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
> at 
> o

[jira] [Updated] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-14 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-1366:
-

Attachment: YARN-1366.2.patch

Synched up offline with Anubhav for doubts mentioned in previous comment.
I made changes in MapReduce as well as in AMRMClientImpl
  1. reset responseId to 0
  2. re register with RM
  3. add back all pending request and update blacklisted nodes.

> ApplicationMasterService should Resync with the AM upon allocate call after 
> restart
> ---
>
> Key: YARN-1366
> URL: https://issues.apache.org/jira/browse/YARN-1366
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Rohith
> Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
> YARN-1366.prototype.patch, YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997144#comment-13997144
 ] 

Jian He commented on YARN-1368:
---

New patch fixed Wangda's comments and also implemented the specific recover 
methods for FifoScheduler queue. This patch should be rebased on top of 
YARN-2017.

> Common work to re-populate containers’ state into scheduler
> ---
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Jian He
> Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
> YARN-1368.combined.001.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2053) Slider AM fails to restart

2014-05-14 Thread Sumit Mohanty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Mohanty updated YARN-2053:


Attachment: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak
yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak

> Slider AM fails to restart
> --
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
> Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {noformat}
> 14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 
> 48058,address tracking URL=http://c6403.ambari.apache.org:48705
> 14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Exception: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.p

[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-14 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1368:
--

Attachment: YARN-1368.2.patch

> Common work to re-populate containers’ state into scheduler
> ---
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Jian He
> Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
> YARN-1368.combined.001.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts

2014-05-14 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2053:
-

Summary: Slider AM fails to restart: NPE in 
RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
  (was: Slider AM fails to restart: NPE in )

> Slider AM fails to restart: NPE in 
> RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
> 
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
> Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {code}
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests

2014-05-14 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992900#comment-13992900
 ] 

Chris Riccomini commented on YARN-2027:
---

Dug into this a bit more. Not entirely convinced that the TreeSet stuff is 
actually an issue anymore. RMContainerRequestor.makeRemoteRequest calls:

{code}
  allocateResponse = scheduler.allocate(allocateRequest);
{code}

If you drill down through the capacity scheduler, into 
SchedulerApplicationAttempt and AppSchedulingInfo, you'll eventually see that 
AppSchedulingInfo.updateResourceRequests simply adds the items in "ask" into a 
map based on priority. The order in which these asks come in seem to always be 
with ANY first (see above), so updatePendingResources will always be true, but 
this doesn't seem harmful.

Anyway, any ideas why YARN is ignoring host requests?

> YARN ignores host-specific resource requests
> 
>
> Key: YARN-2027
> URL: https://issues.apache.org/jira/browse/YARN-2027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.4.0
> Environment: RHEL 6.1
> YARN 2.4
>Reporter: Chris Riccomini
>
> YARN appears to be ignoring host-level ContainerRequests.
> I am creating a container request with code that pretty closely mirrors the 
> DistributedShell code:
> {code}
>   protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) 
> {
> info("Requesting %d container(s) with %dmb of memory" format (containers, 
> memMb))
> val capability = Records.newRecord(classOf[Resource])
> val priority = Records.newRecord(classOf[Priority])
> priority.setPriority(0)
> capability.setMemory(memMb)
> capability.setVirtualCores(cpuCores)
> // Specifying a host in the String[] host parameter here seems to do 
> nothing. Setting relaxLocality to false also doesn't help.
> (0 until containers).foreach(idx => amClient.addContainerRequest(new 
> ContainerRequest(capability, null, null, priority)))
>   }
> {code}
> When I run this code with a specific host in the ContainerRequest, YARN does 
> not honor the request. Instead, it puts the container on an arbitrary host. 
> This appears to be true for both the FifoScheduler and the CapacityScheduler.
> Currently, we are running the CapacityScheduler with the following settings:
> {noformat}
> 
>   
> yarn.scheduler.capacity.maximum-applications
> 1
> 
>   Maximum number of applications that can be pending and running.
> 
>   
>   
> yarn.scheduler.capacity.maximum-am-resource-percent
> 0.1
> 
>   Maximum percent of resources in the cluster which can be used to run
>   application masters i.e. controls number of concurrent running
>   applications.
> 
>   
>   
> yarn.scheduler.capacity.resource-calculator
> 
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
> 
>   The ResourceCalculator implementation to be used to compare
>   Resources in the scheduler.
>   The default i.e. DefaultResourceCalculator only uses Memory while
>   DominantResourceCalculator uses dominant-resource to compare
>   multi-dimensional resources such as Memory, CPU etc.
> 
>   
>   
> yarn.scheduler.capacity.root.queues
> default
> 
>   The queues at the this level (root is the root queue).
> 
>   
>   
> yarn.scheduler.capacity.root.default.capacity
> 100
> Samza queue target capacity.
>   
>   
> yarn.scheduler.capacity.root.default.user-limit-factor
> 1
> 
>   Default queue user limit a percentage from 0.0 to 1.0.
> 
>   
>   
> yarn.scheduler.capacity.root.default.maximum-capacity
> 100
> 
>   The maximum capacity of the default queue.
> 
>   
>   
> yarn.scheduler.capacity.root.default.state
> RUNNING
> 
>   The state of the default queue. State can be one of RUNNING or STOPPED.
> 
>   
>   
> yarn.scheduler.capacity.root.default.acl_submit_applications
> *
> 
>   The ACL of who can submit jobs to the default queue.
> 
>   
>   
> yarn.scheduler.capacity.root.default.acl_administer_queue
> *
> 
>   The ACL of who can administer jobs on the default queue.
> 
>   
>   
> yarn.scheduler.capacity.node-locality-delay
> 40
> 
>   Number of missed scheduling opportunities after which the 
> CapacityScheduler
>   attempts to schedule rack-local containers.
>   Typically this should be set to number of nodes in the cluster, By 
> default is setting
>   approximately number of nodes in one rack which is 40.
> 
>   
> 
> {noformat}
> Digging into the code a bit (props to [~jghoman] for finding this), we have a 
> theory as to why this is happening. It looks lik

[jira] [Updated] (YARN-2056) Disable preemption at Queue level

2014-05-14 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2056:


Description: We need to be able to disable preemption at individual queue 
level  (was: If Queue A does not have enough capacity to run AM, then AM will 
borrow capacity from queue B to run AM in that case AM will be killed if queue 
B will reclaim its capacity and again AM will be launched and killed again, in 
that case job will be failed.)

> Disable preemption at Queue level
> -
>
> Key: YARN-2056
> URL: https://issues.apache.org/jira/browse/YARN-2056
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
> Fix For: 2.1.0-beta
>
>
> We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1302) Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol

2014-05-14 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996627#comment-13996627
 ] 

Zhijie Shen commented on YARN-1302:
---

Anyway, leave it open to see whether it is required to expose the DT access via 
ApplicationHistoryProtocol as well.

> Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol
> --
>
> Key: YARN-1302
> URL: https://issues.apache.org/jira/browse/YARN-1302
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> Like the ApplicationClientProtocol, ApplicationHistoryProtocol needs its own 
> security stack. We need to implement AHSDelegationTokenSecretManager, 
> AHSDelegationTokenIndentifier, AHSDelegationTokenSelector and other analogs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in

2014-05-14 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2053:
-

Summary: Slider AM fails to restart: NPE in   (was: Slider AM fails to 
restart)

> Slider AM fails to restart: NPE in 
> ---
>
> Key: YARN-2053
> URL: https://issues.apache.org/jira/browse/YARN-2053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sumit Mohanty
> Attachments: yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
> yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak
>
>
> Slider AppMaster restart fails with the following:
> {code}
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2053) Slider AM fails to restart

2014-05-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997461#comment-13997461
 ] 

Steve Loughran commented on YARN-2053:
--

{code}
{noformat}
14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 
48058,address tracking URL=http://c6403.ambari.apache.org:48705
14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

Exception: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at or

[jira] [Updated] (YARN-1049) ContainerExistStatus should define a status for preempted containers

2014-05-14 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1049:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-45

> ContainerExistStatus should define a status for preempted containers
> 
>
> Key: YARN-1049
> URL: https://issues.apache.org/jira/browse/YARN-1049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Affects Versions: 2.1.0-beta
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.1.1-beta
>
> Attachments: YARN-1049.patch
>
>
> With the current behavior is impossible to determine if a container has been 
> preempted or lost due to a NM crash.
> Adding a PREEMPTED exit status (-102) will help an AM determine that a 
> container has been preempted.
> Note the change of scope from the original summary/description. The original 
> scope proposed API/behavior changes. Because we are passed 2.1.0-beta I'm 
> reducing the scope of this JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-14 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997328#comment-13997328
 ] 

Sunil G commented on YARN-2055:
---

Hi Mayank,
Is this issue same as YARN-2022 ?

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
> Fix For: 2.1.0-beta
>
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

2014-05-14 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1408:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-45

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
> timeout for 30mins
> --
>
> Key: YARN-1408
> URL: https://issues.apache.org/jira/browse/YARN-1408
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.2.0
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.5.0
>
> Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
> Yarn-1408.4.patch, Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  
> yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
> capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update 
> from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM 
> heartbeat reached RM.
> ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> ACQUIRED at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container 
> was already killed by preemption.
> attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2053) Slider AM fails to restart

2014-05-14 Thread Sumit Mohanty (JIRA)
Sumit Mohanty created YARN-2053:
---

 Summary: Slider AM fails to restart
 Key: YARN-2053
 URL: https://issues.apache.org/jira/browse/YARN-2053
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sumit Mohanty


Slider AppMaster restart fails with the following:

{noformat}
14/05/10 17:02:17 INFO appmaster.SliderAppMaster: Connecting to RM at 
48058,address tracking URL=http://c6403.ambari.apache.org:48705
14/05/10 17:02:17 ERROR main.ServiceLauncher: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

Exception: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.convertToProtoFormat(RegisterApplicationMasterResponsePBImpl.java:384)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.access$100(RegisterApplicationMasterResponsePBImpl.java:53)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:355)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl$2$1.next(RegisterApplicationMasterResponsePBImpl.java:344)
at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToBuilder(RegisterApplicationMasterResponsePBImpl.java:123)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.mergeLocalToProto(RegisterApplicationMasterResponsePBImpl.java:104)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterResponsePBImpl.getProto(RegisterApplicationMasterResponsePBImpl.java:75)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.registerApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:91)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.c

[jira] [Commented] (YARN-1981) Nodemanager version is not updated when a node reconnects

2014-05-14 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996825#comment-13996825
 ] 

Jonathan Eagles commented on YARN-1981:
---

+1. lgtm. Committing to branch-2 and trunk. Thanks, [~jlowe].

> Nodemanager version is not updated when a node reconnects
> -
>
> Key: YARN-1981
> URL: https://issues.apache.org/jira/browse/YARN-1981
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1981.patch
>
>
> When a nodemanager is quickly restarted and happens to change versions during 
> the restart (e.g.: rolling upgrade scenario) the NM version as reported by 
> the RM is not updated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1809) Synchronize RM and Generic History Service Web-UIs

2014-05-14 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1809:
--

Target Version/s: 2.5.0

> Synchronize RM and Generic History Service Web-UIs
> --
>
> Key: YARN-1809
> URL: https://issues.apache.org/jira/browse/YARN-1809
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, 
> YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch, YARN-1809.6.patch, 
> YARN-1809.7.patch, YARN-1809.8.patch, YARN-1809.9.patch
>
>
> After YARN-953, the web-UI of generic history service is provide more 
> information than that of RM, the details about app attempt and container. 
> It's good to provide similar web-UIs, but retrieve the data from separate 
> source, i.e., RM cache and history store respectively.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2014-05-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996965#comment-13996965
 ] 

Jason Lowe commented on YARN-2014:
--

HADOOP-7549 added service loading of filesystems, and HADOOP-7350 added service 
loading of compression codecs.  I'll see if I have some time to disable the 
service loading of unnecessary classes.

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>Assignee: Jason Lowe
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-14 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2055:
--

Summary: Preemption: Jobs are failing due to AMs are getting launched and 
killed multiple times  (was: Preemtion: Jobs are failing due to AMs are getting 
launched and killed multiple times)

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
> Fix For: 2.1.0-beta
>
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-14 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997351#comment-13997351
 ] 

Sunil G commented on YARN-2022:
---

Thank you Carlo for the clarifications on am-priority and user-limit-factor.

I agree with your point on a possible tampering on container priority as 0. On 
this point, I feel your option 1 may be more ideal (track which container is AM 
not via Priority).
Because even with option 2, AM container has to be found first from multiple 
containers at Priority=0. In this case save AM first and then save other 
containers max possible, may not be much suitable with many applications marked 
for preemption.

When an AM container is launched, RM has to set a way to mark it as an AM 
Container. 
CapacityScheduler has RMContext, and may be from that with ApplicationAttemptID 
we can get MasterContainer. I feel this may be little complex look-up. Rather 
it is better to set some property directly on a container to mark as 
MasterContainer.

Also with user-limit-factor and max-user-percentage, scheduler keeps skipping 
containers and such an AM is asking for containers again are not so good. And 
if this AM is a "savedAM" from preemption, it will be even bad. For this also 
we can place a checkpoint decision whether to save or not.
So to summarize roughly, 
1)  A better marking for finding AM container is needed. [Can 
see whether this can be extendable to save multiple container of low priority 
also]
2)  A checkpoint has to be derived based on below factors to 
save an AM or not
a. max-am-percentage limit has to be honored.
b. user-limit-factor or max-user-percentage also has to 
be checked.

I can first try to post a design approach on deriving checkpoint decision from 
both a. and b. above. Please share more thoughts if any on this.

> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> -
>
> Key: YARN-2022
> URL: https://issues.apache.org/jira/browse/YARN-2022
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Yarn-2022.1.patch
>
>
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J2 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()

2014-05-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997334#comment-13997334
 ] 

Sandy Ryza commented on YARN-2042:
--

+1

> String shouldn't be compared using == in 
> QueuePlacementRule#NestedUserQueue#getQueueForApp()
> 
>
> Key: YARN-2042
> URL: https://issues.apache.org/jira/browse/YARN-2042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Chen He
>Priority: Minor
> Attachments: YARN-2042.patch
>
>
> {code}
>   if (queueName != null && queueName != "") {
> {code}
> queueName.isEmpty() should be used instead of comparing against ""



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1986) In Fifo Scheduler, node heartbeat in between creating app and attempt causes NPE

2014-05-14 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-1986:
-

Assignee: Hong Zhiguo  (was: Sandy Ryza)

> In Fifo Scheduler, node heartbeat in between creating app and attempt causes 
> NPE
> 
>
> Key: YARN-1986
> URL: https://issues.apache.org/jira/browse/YARN-1986
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Jon Bringhurst
>Assignee: Hong Zhiguo
>Priority: Critical
> Attachments: YARN-1986-2.patch, YARN-1986-3.patch, 
> YARN-1986-testcase.patch, YARN-1986.patch
>
>
> After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
> -After RM was restarted, the job runs without a problem.-
> {noformat}
> 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type 
> NODE_UPDATE to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
>   at java.lang.Thread.run(Thread.java:744)
> 19:11:13,443  INFO ResourceManager:604 - Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2055) Preemtion: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-14 Thread Mayank Bansal (JIRA)
Mayank Bansal created YARN-2055:
---

 Summary: Preemtion: Jobs are failing due to AMs are getting 
launched and killed multiple times
 Key: YARN-2055
 URL: https://issues.apache.org/jira/browse/YARN-2055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Sunil G


Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full 
cluster capacity. 
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps

Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.

It is better if AM can be given least priority among multiple applications. In 
this same scenario, map tasks from J3 and J2 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)