[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726122#comment-13726122
 ] 

Sandy Ryza commented on YARN-1004:
--

Uploaded a patch that adds "fair", "capacity", and "fifo" to the minimum and 
increment configs.  The patch turned out to require quite a few changes.  I 
wasn't sure how to deal with the slots-millis job counter - it seems like it 
doesn't make sense in the context of MR2 - so I removed it.  If this would 
delay the release I'm not sure it's worth it.

[~tucu], you worked on the per-scheduler separation of these configs.  Any 
thoughts?

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
> Attachments: YARN-1004.patch
>
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726124#comment-13726124
 ] 

Sandy Ryza commented on YARN-1004:
--

By which I mean [~tucu00]

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
> Attachments: YARN-1004.patch
>
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (YARN-758) Augment MockNM to use multiple cores

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened YARN-758:
--


Our test-patch is not smart enough, there isn't any test in the patch but it 
didn't complain.

I think we should have a candidate test that verifies that the code changes 
works (and more importantly useful). TestRMRestart used to be that, we could 
add a very simple test that hangs before the change and passes with.

> Augment MockNM to use multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Fix For: 2.3.0
>
> Attachments: yarn-758-1.patch, yarn-758-2.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1003) Add a maxContainersPerNode config to the Fair Scheduler

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726118#comment-13726118
 ] 

Vinod Kumar Vavilapalli commented on YARN-1003:
---

bq. This makes it so that we don't need to use the resources we currently have 
as proxies for the ones we don't. If we add resources like disk and network 
I/O, this will become much less necessary, but I still think a high number of 
containers will put load on a system in ways that we won't account for. Open 
file descriptors for example. If my machine has 8 GB, I might want to allow a 
512 MB container to fit in between a 4 GB one and a 3.5 GB one, but might not 
want to allow 16 512 MB containers.
That doesn't sound right. While I agree that we aren't supporting all 
resources, this can be done by being conservative on the memory and/or cpu 
cores per node. Adding a new config won't make a difference, how much would you 
set it to be?

> Add a maxContainersPerNode config to the Fair Scheduler
> ---
>
> Key: YARN-1003
> URL: https://issues.apache.org/jira/browse/YARN-1003
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Karthik Kambatla
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-1004:
-

Attachment: YARN-1004.patch

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
> Attachments: YARN-1004.patch
>
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-758) Augment MockNM to use multiple cores

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726114#comment-13726114
 ] 

Hudson commented on YARN-758:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #4197 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4197/])
YARN-758. Augment MockNM to use multiple cores (Karthik Kambatla via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1509086)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java


> Augment MockNM to use multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Fix For: 2.3.0
>
> Attachments: yarn-758-1.patch, yarn-758-2.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1003) Add a maxContainersPerNode config to the Fair Scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726111#comment-13726111
 ] 

Sandy Ryza commented on YARN-1003:
--

My thinking was not per job.  Agreed that the way to do that should be through 
the AM.

> Add a maxContainersPerNode config to the Fair Scheduler
> ---
>
> Key: YARN-1003
> URL: https://issues.apache.org/jira/browse/YARN-1003
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Karthik Kambatla
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1003) Add a maxContainersPerNode config to the Fair Scheduler

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726101#comment-13726101
 ] 

Vinod Kumar Vavilapalli commented on YARN-1003:
---

I guess you are referring to restricting the number of tasks per node *per 
job*. If so, the implementation was always a hack in MR1. It was first wrongly 
put outside the scheduler, and then pushed into FairScheduler as devs behind FS 
wanted the feature atleast in FS as there was no correct way to do it in 
MR1/JT. 

The correct way to do this now with YARN is to implement it inside the AM. 
That's what users always wanted really, to restrict this job to run say only 
one task per node.

> Add a maxContainersPerNode config to the Fair Scheduler
> ---
>
> Key: YARN-1003
> URL: https://issues.apache.org/jira/browse/YARN-1003
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Karthik Kambatla
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1003) Add a maxContainersPerNode config to the Fair Scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726100#comment-13726100
 ] 

Sandy Ryza commented on YARN-1003:
--

This makes it so that we don't need to use the resources we currently have as 
proxies for the ones we don't.  If we add resources like disk and network I/O, 
this will become much less necessary, but I still think a high number of 
containers will put load on a system in ways that we won't account for.  Open 
file descriptors for example.  If my machine has 8 GB, I might want to allow a 
512 MB container to fit in between a 4 GB one and a 3.5 GB one, but might not 
want to allow 16 512 MB containers.

> Add a maxContainersPerNode config to the Fair Scheduler
> ---
>
> Key: YARN-1003
> URL: https://issues.apache.org/jira/browse/YARN-1003
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Karthik Kambatla
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1003) Add a maxContainersPerNode config to the Fair Scheduler

2013-07-31 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726095#comment-13726095
 ] 

Karthik Kambatla commented on YARN-1003:


One of the most common questions we hear from people migrating to MR2 is how 
they can restrict the number of tasks per node. While they can adjust the task 
memory/cpu requirements for this, it is more involved compared to the MR1 model 
where one can set the max maps/reduces per node.

> Add a maxContainersPerNode config to the Fair Scheduler
> ---
>
> Key: YARN-1003
> URL: https://issues.apache.org/jira/browse/YARN-1003
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Karthik Kambatla
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1003) Add a maxContainersPerNode config to the Fair Scheduler

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726089#comment-13726089
 ] 

Vinod Kumar Vavilapalli commented on YARN-1003:
---

Why is this needed? We already have upper limits on resources per node already.

> Add a maxContainersPerNode config to the Fair Scheduler
> ---
>
> Key: YARN-1003
> URL: https://issues.apache.org/jira/browse/YARN-1003
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Karthik Kambatla
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-994) HeartBeat thread in AMRMClientAsync does not handle runtime exception correctly

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726087#comment-13726087
 ] 

Hadoop QA commented on YARN-994:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595350/YARN-994.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1635//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1635//console

This message is automatically generated.

> HeartBeat thread in AMRMClientAsync does not handle runtime exception 
> correctly
> ---
>
> Key: YARN-994
> URL: https://issues.apache.org/jira/browse/YARN-994
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-994.1.patch, YARN-994.2.patch
>
>
> YARN-654 performs sanity checks for parameters of public methods in 
> AMRMClient. Those may create runtime exception. 
> Currently, heartBeat thread in AMRMClientAsync only captures IOException and 
> YarnException, and will not handle Runtime Exception properly. 
> Possible solution can be: heartbeat thread will catch throwable and notify 
> the callbackhandler thread via existing savedException

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-994) HeartBeat thread in AMRMClientAsync does not handle runtime exception correctly

2013-07-31 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-994:
---

Attachment: YARN-994.2.patch

Add a testcase to verify that the runtimeException will be captured and the 
callback handler is invoked correctly

> HeartBeat thread in AMRMClientAsync does not handle runtime exception 
> correctly
> ---
>
> Key: YARN-994
> URL: https://issues.apache.org/jira/browse/YARN-994
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-994.1.patch, YARN-994.2.patch
>
>
> YARN-654 performs sanity checks for parameters of public methods in 
> AMRMClient. Those may create runtime exception. 
> Currently, heartBeat thread in AMRMClientAsync only captures IOException and 
> YarnException, and will not handle Runtime Exception properly. 
> Possible solution can be: heartbeat thread will catch throwable and notify 
> the callbackhandler thread via existing savedException

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-758) Augment MockNM to use multiple cores

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726069#comment-13726069
 ] 

Sandy Ryza commented on YARN-758:
-

+1

> Augment MockNM to use multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch, yarn-758-2.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-107) ClientRMService.forceKillApplication() should handle the non-RUNNING applications properly

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726063#comment-13726063
 ] 

Hadoop QA commented on YARN-107:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595345/YARN-107.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1634//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1634//console

This message is automatically generated.

> ClientRMService.forceKillApplication() should handle the non-RUNNING 
> applications properly
> --
>
> Key: YARN-107
> URL: https://issues.apache.org/jira/browse/YARN-107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Devaraj K
>Assignee: Xuan Gong
> Attachments: YARN-107.1.patch, YARN-107.2.patch, YARN-107.3.patch, 
> YARN-107.4.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-107) ClientRMService.forceKillApplication() should handle the non-RUNNING applications properly

2013-07-31 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-107:
---

Attachment: YARN-107.4.patch

Check state in CLI before we send the kill command, and will print out 
different message if the application is already in terminated state. Also keep 
forceKillApplication return quietly if it tries to kill a non-running 
application


> ClientRMService.forceKillApplication() should handle the non-RUNNING 
> applications properly
> --
>
> Key: YARN-107
> URL: https://issues.apache.org/jira/browse/YARN-107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Devaraj K
>Assignee: Xuan Gong
> Attachments: YARN-107.1.patch, YARN-107.2.patch, YARN-107.3.patch, 
> YARN-107.4.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-808) ApplicationReport does not clearly tell that the attempt is running or not

2013-07-31 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726042#comment-13726042
 ] 

Zhijie Shen commented on YARN-808:
--

+1 for embedding ApplicationAttemptReport in ApplicationReport.

Think it out loudly. I've a concern that with more info to be fetched, 
getApplicationReport is likely to be slower, and the response message is likely 
to be bigger. However, users may not always want to know all the info of an 
application, such as the embedded ApplicationAttemptReport, right? Sometimes 
users just want to fetch partial information of an application to speed up the 
response. Thoughts?

> ApplicationReport does not clearly tell that the attempt is running or not
> --
>
> Key: YARN-808
> URL: https://issues.apache.org/jira/browse/YARN-808
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-808.1.patch
>
>
> When an app attempt fails and is being retried, ApplicationReport immediately 
> gives the new attemptId and non-null values of host etc. There is no way for 
> clients to know that the attempt is running other than connecting to it and 
> timing out on invalid host. Solution would be to expose the attempt state or 
> return a null value for host instead of "N/A"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-994) HeartBeat thread in AMRMClientAsync does not handle runtime exception correctly

2013-07-31 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726039#comment-13726039
 ] 

Xuan Gong commented on YARN-994:


Yes, i will do that

> HeartBeat thread in AMRMClientAsync does not handle runtime exception 
> correctly
> ---
>
> Key: YARN-994
> URL: https://issues.apache.org/jira/browse/YARN-994
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-994.1.patch
>
>
> YARN-654 performs sanity checks for parameters of public methods in 
> AMRMClient. Those may create runtime exception. 
> Currently, heartBeat thread in AMRMClientAsync only captures IOException and 
> YarnException, and will not handle Runtime Exception properly. 
> Possible solution can be: heartbeat thread will catch throwable and notify 
> the callbackhandler thread via existing savedException

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-855) YarnClient.init should ensure that yarn parameters are present

2013-07-31 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725969#comment-13725969
 ] 

Siddharth Seth commented on YARN-855:
-

The simplest would be to check the configuration type - which keeps the API 
stable.

The reason I mentioned parameters is that apps that use YarnClient may have 
their own configuration type - e.g. JobConf or a HiveConf. Type information 
ends up getting lost even if these apps have created their configurations based 
on a YarnConfiguration.

> YarnClient.init should ensure that yarn parameters are present
> --
>
> Key: YARN-855
> URL: https://issues.apache.org/jira/browse/YARN-855
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.0.5-alpha
>Reporter: Siddharth Seth
>Assignee: Abhishek Kapoor
>
> It currently accepts a Configuration object in init and doesn't check whether 
> it contains yarn parameters or is a YarnConfiguration. Should either accept 
> YarnConfiguration, check existence of parameters or create a 
> YarnConfiguration based on the configuration passed to it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-758) Augment MockNM to use multiple cores

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725950#comment-13725950
 ] 

Hadoop QA commented on YARN-758:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595317/yarn-758-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1633//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1633//console

This message is automatically generated.

> Augment MockNM to use multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch, yarn-758-2.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-770) NPE NodeStatusUpdaterImpl

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-770.
--

Resolution: Invalid

[~ste...@apache.org], I am closing this as invalid for now. The code has 
changed a lot and it isn't apparent what was causing it.

Please feel free to reopen it when you run into it again. Thanks!

> NPE NodeStatusUpdaterImpl
> -
>
> Key: YARN-770
> URL: https://issues.apache.org/jira/browse/YARN-770
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> A mini yarn cluster based test just failed -NPE in the logs in 
> {{NodeStatusUpdaterImpl}}, which is probably a symptom of the problem, not 
> the cause -network trouble more likely there- but it shows there's some extra 
> checking for null responses.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-758) Augment MockNM to use multiple cores

2013-07-31 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-758:
--

Summary: Augment MockNM to use multiple cores  (was: TestRMRestart should 
use MockNMs with multiple cores)

> Augment MockNM to use multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch, yarn-758-2.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-758) TestRMRestart should use MockNMs with multiple cores

2013-07-31 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-758:
--

Attachment: yarn-758-2.patch

Updated patch to address Sandy's comment. With this fix, there was no need to 
touch TestRMRestart, verified it passes.

> TestRMRestart should use MockNMs with multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch, yarn-758-2.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-903) DistributedShell throwing Errors in logs after successfull completion

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725936#comment-13725936
 ] 

Hadoop QA commented on YARN-903:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12595315/YARN-903-20130731.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1632//console

This message is automatically generated.

> DistributedShell throwing Errors in logs after successfull completion
> -
>
> Key: YARN-903
> URL: https://issues.apache.org/jira/browse/YARN-903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
>Affects Versions: 2.0.4-alpha
> Environment: Ununtu 11.10
>Reporter: Abhishek Kapoor
>Assignee: Omkar Vinit Joshi
> Attachments: AppMaster.stderr, YARN-903-20130717.1.patch, 
> YARN-903-20130718.1.patch, YARN-903-20130723.patch, 
> YARN-903-20130729.1.patch, YARN-903-20130730.1.patch, 
> YARN-903-20130731.1.patch, YARN-903-20130731.2.patch, 
> yarn-sunny-nodemanager-sunny-Inspiron.log
>
>
> I have tried running DistributedShell and also used ApplicationMaster of the 
> same for my test.
> The application is successfully running through logging some errors which 
> would be useful to fix.
> Below are the logs from NodeManager and ApplicationMasterode
> Log Snippet for NodeManager
> =
> 2013-07-07 13:39:18,787 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connecting 
> to ResourceManager at localhost/127.0.0.1:9990. current no. of attempts is 1
> 2013-07-07 13:39:19,050 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
>  Rolling master-key for container-tokens, got key with id -325382586
> 2013-07-07 13:39:19,052 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: 
> Rolling master-key for nm-tokens, got key with id :1005046570
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
> with ResourceManager as sunny-Inspiron:9993 with total resource of 
> 
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying 
> ContainerManager to unblock new container-requests
> 2013-07-07 13:39:35,256 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1373184544832_0001_01 (auth:SIMPLE)
> 2013-07-07 13:39:35,492 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1373184544832_0001_01_01 by user sunny
> 2013-07-07 13:39:35,507 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1373184544832_0001
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny  
> IP=127.0.0.1OPERATION=Start Container Request   
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1373184544832_0001
> CONTAINERID=container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from NEW to INITING
> 2013-07-07 13:39:35,512 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Adding container_1373184544832_0001_01_01 to application 
> application_1373184544832_0001
> 2013-07-07 13:39:35,518 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from INITING to 
> RUNNING
> 2013-07-07 13:39:35,528 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1373184544832_0001_01_01 transitioned from NEW to 
> LOCALIZING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://localhost:9000/application/test.jar transitioned from INIT 
> to DOWNLOADING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,675 INFO 

[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725932#comment-13725932
 ] 

Hadoop QA commented on YARN-573:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12595310/YARN-573-20130731.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1631//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1631//console

This message is automatically generated.

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-502) RM crash with NPE on NODE_REMOVED event with FairScheduler

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725929#comment-13725929
 ] 

Hudson commented on YARN-502:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #4193 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4193/])
YARN-502. Fixed a state machine issue with RMNode inside ResourceManager which 
was crashing scheduler. Contributed by Mayank Bansal. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1509060)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java


> RM crash with NPE on NODE_REMOVED event with FairScheduler
> --
>
> Key: YARN-502
> URL: https://issues.apache.org/jira/browse/YARN-502
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.0.3-alpha
>Reporter: Lohit Vijayarenu
>Assignee: Mayank Bansal
> Fix For: 2.1.1-beta
>
> Attachments: YARN-502-trunk-1.patch, YARN-502-trunk-2.patch, 
> YARN-502-trunk-3.patch
>
>
> While running some test and adding/removing nodes, we see RM crashed with the 
> below exception. We are testing with fair scheduler and running 
> hadoop-2.0.3-alpha
> {noformat}
> 2013-03-22 18:54:27,015 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node :55680 as it is now LOST
> 2013-03-22 18:54:27,015 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: :55680 
> Node Transitioned from UNHEALTHY to LOST
> 2013-03-22 18:54:27,015 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_REMOVED to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375)
> at java.lang.Thread.run(Thread.java:662)
> 2013-03-22 18:54:27,016 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> 2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped 
> SelectChannelConnector@:50030
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-854) App submission fails on secure deploy

2013-07-31 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725923#comment-13725923
 ] 

Konstantin Boudnik commented on YARN-854:
-

I have started the release process...



> App submission fails on secure deploy
> -
>
> Key: YARN-854
> URL: https://issues.apache.org/jira/browse/YARN-854
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Ramya Sunil
>Assignee: Omkar Vinit Joshi
>Priority: Blocker
> Fix For: 2.1.0-beta
>
> Attachments: YARN-854.20130619.1.patch, YARN-854.20130619.2.patch, 
> YARN-854.20130619.patch, YARN-854-branch-2.0.6.patch
>
>
> App submission on secure cluster fails with the following exception:
> {noformat}
> INFO mapreduce.Job: Job jobID failed with state FAILED due to: Application 
> applicationID failed 2 times due to AM Container for appattemptID exited with 
>  exitCode: -1000 due to: App initialization failed (255) with output: main : 
> command provided 0
> main : user is qa_user
> javax.security.sasl.SaslException: DIGEST-MD5: digest response format 
> violation. Mismatched response. [Caused by 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
> DIGEST-MD5: digest response format violation. Mismatched response.]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:65)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:348)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
> DIGEST-MD5: digest response format violation. Mismatched response.
>   at org.apache.hadoop.ipc.Client.call(Client.java:1298)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1250)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:204)
>   at $Proxy7.heartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
>   ... 3 more
> .Failing this attempt.. Failing the application.
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-903) DistributedShell throwing Errors in logs after successfull completion

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-903:
---

Attachment: YARN-903-20130731.2.patch

> DistributedShell throwing Errors in logs after successfull completion
> -
>
> Key: YARN-903
> URL: https://issues.apache.org/jira/browse/YARN-903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
>Affects Versions: 2.0.4-alpha
> Environment: Ununtu 11.10
>Reporter: Abhishek Kapoor
>Assignee: Omkar Vinit Joshi
> Attachments: AppMaster.stderr, YARN-903-20130717.1.patch, 
> YARN-903-20130718.1.patch, YARN-903-20130723.patch, 
> YARN-903-20130729.1.patch, YARN-903-20130730.1.patch, 
> YARN-903-20130731.1.patch, YARN-903-20130731.2.patch, 
> yarn-sunny-nodemanager-sunny-Inspiron.log
>
>
> I have tried running DistributedShell and also used ApplicationMaster of the 
> same for my test.
> The application is successfully running through logging some errors which 
> would be useful to fix.
> Below are the logs from NodeManager and ApplicationMasterode
> Log Snippet for NodeManager
> =
> 2013-07-07 13:39:18,787 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connecting 
> to ResourceManager at localhost/127.0.0.1:9990. current no. of attempts is 1
> 2013-07-07 13:39:19,050 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
>  Rolling master-key for container-tokens, got key with id -325382586
> 2013-07-07 13:39:19,052 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: 
> Rolling master-key for nm-tokens, got key with id :1005046570
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
> with ResourceManager as sunny-Inspiron:9993 with total resource of 
> 
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying 
> ContainerManager to unblock new container-requests
> 2013-07-07 13:39:35,256 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1373184544832_0001_01 (auth:SIMPLE)
> 2013-07-07 13:39:35,492 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1373184544832_0001_01_01 by user sunny
> 2013-07-07 13:39:35,507 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1373184544832_0001
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny  
> IP=127.0.0.1OPERATION=Start Container Request   
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1373184544832_0001
> CONTAINERID=container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from NEW to INITING
> 2013-07-07 13:39:35,512 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Adding container_1373184544832_0001_01_01 to application 
> application_1373184544832_0001
> 2013-07-07 13:39:35,518 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from INITING to 
> RUNNING
> 2013-07-07 13:39:35,528 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1373184544832_0001_01_01 transitioned from NEW to 
> LOCALIZING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://localhost:9000/application/test.jar transitioned from INIT 
> to DOWNLOADING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,675 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544832_0001_01_01.tokens.
>  Credentials list: 
> 2013-07-07 13:39:35,694 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
> Initializing user sunny
> 2013-07-07 13:39:35,803 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying 
> from 
> /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544832_0001

[jira] [Commented] (YARN-502) RM crash with NPE on NODE_REMOVED event with FairScheduler

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725913#comment-13725913
 ] 

Vinod Kumar Vavilapalli commented on YARN-502:
--

Though the explicit state check is a little unmaintainable if we have new 
states in future, the current change is less intrusive. The better way could 
have been creating new transition class, but I'm okay.

+1, checking this in.

> RM crash with NPE on NODE_REMOVED event with FairScheduler
> --
>
> Key: YARN-502
> URL: https://issues.apache.org/jira/browse/YARN-502
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.0.3-alpha
>Reporter: Lohit Vijayarenu
>Assignee: Mayank Bansal
> Attachments: YARN-502-trunk-1.patch, YARN-502-trunk-2.patch, 
> YARN-502-trunk-3.patch
>
>
> While running some test and adding/removing nodes, we see RM crashed with the 
> below exception. We are testing with fair scheduler and running 
> hadoop-2.0.3-alpha
> {noformat}
> 2013-03-22 18:54:27,015 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node :55680 as it is now LOST
> 2013-03-22 18:54:27,015 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: :55680 
> Node Transitioned from UNHEALTHY to LOST
> 2013-03-22 18:54:27,015 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_REMOVED to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375)
> at java.lang.Thread.run(Thread.java:662)
> 2013-03-22 18:54:27,016 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> 2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped 
> SelectChannelConnector@:50030
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-573:
---

Attachment: YARN-573-20130731.1.patch

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725909#comment-13725909
 ] 

Omkar Vinit Joshi commented on YARN-573:


Thanks [~jlowe] and [~sjlee0] for reviewing..

Fixed the comments..

[~sjlee0] yes ConcurrentLinkedQueue will solve this synchronization issue 
altogether. I am planning to restructure it a lot when we end up fixing 
YARN-574. Today update method is making 2 calls to findNextResources which 
ideally should be one. After that the whole code itself will get simplified a 
lot ..Also inside findNextResources we are repeatedly checking for the same set 
of resources (list) again and again until the resource gets downloaded.. which 
ideally should only be done onceyes but this is out of the scope for this 
jira...will definitely address it on another jira. (YARN-574) 

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-975) Adding HDFS implementation for grouped reading and writing interfaces of history storage

2013-07-31 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-975:
-

Attachment: YARN-975.2.patch

Updated the patch:

1. Fix some logic bugs in the previous patch.

2. Added the test case.

3. Fix the javadoc warnings of ApplicationHistoryReader

4. Change reader's and writer's methods to throw IOException.

> Adding HDFS implementation for grouped reading and writing interfaces of 
> history storage
> 
>
> Key: YARN-975
> URL: https://issues.apache.org/jira/browse/YARN-975
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-975.1.patch, YARN-975.2.patch
>
>
> HDFS implementation should be a standard persistence strategy of history 
> storage

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725898#comment-13725898
 ] 

Allen Wittenauer commented on YARN-972:
---

Thought running through my head: "I hope I have a way to turn this off because 
it does more harm than good.  I guess the alternative is just rip it out of the 
code base.  Thank goodness I build my own releases and I'm not reliant on a 
vendor."

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-758) TestRMRestart should use MockNMs with multiple cores

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725842#comment-13725842
 ] 

Sandy Ryza commented on YARN-758:
-

There are likely a number of other tests that have similar problems.  Would you 
be opposed to having the existing constructor scale the number of vcores by the 
amount of memory so that we can avoid fixing all of them individually?

> TestRMRestart should use MockNMs with multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725837#comment-13725837
 ] 

Sandy Ryza commented on YARN-972:
-

I should have probably tried to be more clear about what I think the goals of 
virtual cores are from a more zoomed-out perspective before arguing about their 
specifics.

The problems we have been considering solving with virtual cores are:
1. "Many of the jobs on my cluster are computational simulations that use many 
threads per task.  Many of the other jobs on my cluster are distcp's that are 
primarily I/O bound.  Many of the jobs on my cluster are MapReduce that do 
something like apply a transformation to text, which are single-threaded, but 
can saturate a core.  How can we schedule these to maximize utilization and 
minimize harmful interference?" 

2. "I recently added machines with more or beefier CPUs to my cluster.  I would 
like to run more concurrent tasks on these machines than on other machines."

3. "I recently added machines with more or beefier CPUs to my cluster.  I would 
like my jobs to run at predictable speeds."

4. "CPUs vary widely in the world, but I would like to be able to take my job 
to another cluster and have it run at a similar speed."

I think (1) is the main problem we should be trying to solve.  (2) is also 
important, and much easier to think about when the new machines have a higher 
number of cores, but not substantially more powerful cores.  Luckily, the trend 
is towards more cores per machine, not more powerful cores.  I think we should 
not be trying to solve (3) and (4). There are too many variables, the 
real-world utility is too small, and the goals are unrealistic. The features 
proposed in YARN-796 are better approaches to handling this.

To these ends, here is how think resource configurations should be used:

A task should request virtual cores equal to the number of cores it thinks it 
can saturate.  A task that runs in a single thread, no matter how CPU-intensive 
it is, should request a single virtual core.  A task that is inherently 
I/O-bound, like a distcp or simple grep, should request less than a single 
virtual core.  A task that can take advantage of multiple threads should 
request a number of cores equal to the number of threads it intends to take 
advantage of.

NodeManagers should be configured with virtual cores equal to the number of 
physical cores on the node.  If the speed of a aingle core varies widely within 
a cluster (maybe by a factor of two or more), an administrator can consider 
configuing more virtual cores than physical cores on the faster nodes, with the 
acknowledgement that task performance will still not be predictable.

Virtual cores should not be used as a proxy for other resources, such as disk 
I/O or network I/O.  We should ultimately add in disk I/O and possibly network 
I/O as another first-class resource, but in the mean time a config to limit the 
number of containers per node seems doesn't seem unreasonable. 

As Arun points out, we can realize this vision equivalently by saying that one 
physical core is always equal to 1000 virtual cores.  However, to me this seems 
like an unnecessary layer of indirection for the user, and obscures the fact 
that virtual cores are meant to model parallelism before processing power.  If 
our only reason for considering this is perfomance, we should and can handle 
this internally.  I am not obstinately opposed to going this route, but if we 
do I think a name like "core thousandths" would be more clear.

Thoughts?

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the

[jira] [Commented] (YARN-956) [YARN-321] Add History Store interface and testable in-memory HistoryStorage

2013-07-31 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725818#comment-13725818
 ] 

Zhijie Shen commented on YARN-956:
--

bq. Can you also write a test for InMemory storage. This will be a good 
starting point to begin writing tests for AHS. Tx,

[~vinodkv], I recall one problem. WRT the test, if it has reference to RM, I 
think it's good to be placed either in resourcemanager or server-test 
sub-project, instead of applicationhistoryservice sub-project. This is because 
resourcemanager sub-project will eventually refer applicationhistoryservice. 
Then, if applicationhistoryservice already has the dependency on 
resourcemanager, there will be cyclic dependency. Recall the issue in YARN-641.

> [YARN-321] Add History Store interface and testable in-memory HistoryStorage
> 
>
> Key: YARN-956
> URL: https://issues.apache.org/jira/browse/YARN-956
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-956-1.patch, YARN-956-2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-808) ApplicationReport does not clearly tell that the attempt is running or not

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725809#comment-13725809
 ] 

Hadoop QA commented on YARN-808:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595279/YARN-808.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1629//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1629//console

This message is automatically generated.

> ApplicationReport does not clearly tell that the attempt is running or not
> --
>
> Key: YARN-808
> URL: https://issues.apache.org/jira/browse/YARN-808
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-808.1.patch
>
>
> When an app attempt fails and is being retried, ApplicationReport immediately 
> gives the new attemptId and non-null values of host etc. There is no way for 
> clients to know that the attempt is running other than connecting to it and 
> timing out on invalid host. Solution would be to expose the attempt state or 
> return a null value for host instead of "N/A"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-903) DistributedShell throwing Errors in logs after successfull completion

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725806#comment-13725806
 ] 

Hadoop QA commented on YARN-903:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12595291/YARN-903-20130731.1.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1630//console

This message is automatically generated.

> DistributedShell throwing Errors in logs after successfull completion
> -
>
> Key: YARN-903
> URL: https://issues.apache.org/jira/browse/YARN-903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
>Affects Versions: 2.0.4-alpha
> Environment: Ununtu 11.10
>Reporter: Abhishek Kapoor
>Assignee: Omkar Vinit Joshi
> Attachments: AppMaster.stderr, YARN-903-20130717.1.patch, 
> YARN-903-20130718.1.patch, YARN-903-20130723.patch, 
> YARN-903-20130729.1.patch, YARN-903-20130730.1.patch, 
> YARN-903-20130731.1.patch, yarn-sunny-nodemanager-sunny-Inspiron.log
>
>
> I have tried running DistributedShell and also used ApplicationMaster of the 
> same for my test.
> The application is successfully running through logging some errors which 
> would be useful to fix.
> Below are the logs from NodeManager and ApplicationMasterode
> Log Snippet for NodeManager
> =
> 2013-07-07 13:39:18,787 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connecting 
> to ResourceManager at localhost/127.0.0.1:9990. current no. of attempts is 1
> 2013-07-07 13:39:19,050 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
>  Rolling master-key for container-tokens, got key with id -325382586
> 2013-07-07 13:39:19,052 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: 
> Rolling master-key for nm-tokens, got key with id :1005046570
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
> with ResourceManager as sunny-Inspiron:9993 with total resource of 
> 
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying 
> ContainerManager to unblock new container-requests
> 2013-07-07 13:39:35,256 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1373184544832_0001_01 (auth:SIMPLE)
> 2013-07-07 13:39:35,492 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1373184544832_0001_01_01 by user sunny
> 2013-07-07 13:39:35,507 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1373184544832_0001
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny  
> IP=127.0.0.1OPERATION=Start Container Request   
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1373184544832_0001
> CONTAINERID=container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from NEW to INITING
> 2013-07-07 13:39:35,512 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Adding container_1373184544832_0001_01_01 to application 
> application_1373184544832_0001
> 2013-07-07 13:39:35,518 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from INITING to 
> RUNNING
> 2013-07-07 13:39:35,528 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1373184544832_0001_01_01 transitioned from NEW to 
> LOCALIZING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://localhost:9000/application/test.jar transitioned from INIT 
> to DOWNLOADING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,675 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544

[jira] [Commented] (YARN-107) ClientRMService.forceKillApplication() should handle the non-RUNNING applications properly

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725802#comment-13725802
 ] 

Vinod Kumar Vavilapalli commented on YARN-107:
--

Clearly, we have two options, throw an exception if an app is already finished 
or return quietly. It is indeed useful to say that an application finished for 
CLI, but the client itself can do that. ApplicationCLI can check the state and 
print a different message. If it is via API, anybody can do a state check if 
need be. In that light, I am not strongly opinionated eitherways, but we can 
keep forceKillApplication just one of a forced operations which usually return 
quietly. (Like rm -f).

> ClientRMService.forceKillApplication() should handle the non-RUNNING 
> applications properly
> --
>
> Key: YARN-107
> URL: https://issues.apache.org/jira/browse/YARN-107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Devaraj K
>Assignee: Xuan Gong
> Attachments: YARN-107.1.patch, YARN-107.2.patch, YARN-107.3.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-903) DistributedShell throwing Errors in logs after successfull completion

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-903:
---

Attachment: YARN-903-20130731.1.patch

> DistributedShell throwing Errors in logs after successfull completion
> -
>
> Key: YARN-903
> URL: https://issues.apache.org/jira/browse/YARN-903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
>Affects Versions: 2.0.4-alpha
> Environment: Ununtu 11.10
>Reporter: Abhishek Kapoor
>Assignee: Omkar Vinit Joshi
> Attachments: AppMaster.stderr, YARN-903-20130717.1.patch, 
> YARN-903-20130718.1.patch, YARN-903-20130723.patch, 
> YARN-903-20130729.1.patch, YARN-903-20130730.1.patch, 
> YARN-903-20130731.1.patch, yarn-sunny-nodemanager-sunny-Inspiron.log
>
>
> I have tried running DistributedShell and also used ApplicationMaster of the 
> same for my test.
> The application is successfully running through logging some errors which 
> would be useful to fix.
> Below are the logs from NodeManager and ApplicationMasterode
> Log Snippet for NodeManager
> =
> 2013-07-07 13:39:18,787 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connecting 
> to ResourceManager at localhost/127.0.0.1:9990. current no. of attempts is 1
> 2013-07-07 13:39:19,050 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
>  Rolling master-key for container-tokens, got key with id -325382586
> 2013-07-07 13:39:19,052 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: 
> Rolling master-key for nm-tokens, got key with id :1005046570
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
> with ResourceManager as sunny-Inspiron:9993 with total resource of 
> 
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying 
> ContainerManager to unblock new container-requests
> 2013-07-07 13:39:35,256 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1373184544832_0001_01 (auth:SIMPLE)
> 2013-07-07 13:39:35,492 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1373184544832_0001_01_01 by user sunny
> 2013-07-07 13:39:35,507 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1373184544832_0001
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny  
> IP=127.0.0.1OPERATION=Start Container Request   
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1373184544832_0001
> CONTAINERID=container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from NEW to INITING
> 2013-07-07 13:39:35,512 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Adding container_1373184544832_0001_01_01 to application 
> application_1373184544832_0001
> 2013-07-07 13:39:35,518 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from INITING to 
> RUNNING
> 2013-07-07 13:39:35,528 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1373184544832_0001_01_01 transitioned from NEW to 
> LOCALIZING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://localhost:9000/application/test.jar transitioned from INIT 
> to DOWNLOADING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,675 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544832_0001_01_01.tokens.
>  Credentials list: 
> 2013-07-07 13:39:35,694 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
> Initializing user sunny
> 2013-07-07 13:39:35,803 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying 
> from 
> /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544832_0001_01_01.tokens
>  to 
> /ho

[jira] [Assigned] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-953:


Assignee: Zhijie Shen  (was: Vinod Kumar Vavilapalli)

> [YARN-321] Change ResourceManager to use HistoryStorage to log history data
> ---
>
> Key: YARN-953
> URL: https://issues.apache.org/jira/browse/YARN-953
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: YARN-953.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-953:


Assignee: Vinod Kumar Vavilapalli  (was: Zhijie Shen)

> [YARN-321] Change ResourceManager to use HistoryStorage to log history data
> ---
>
> Key: YARN-953
> URL: https://issues.apache.org/jira/browse/YARN-953
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-953.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-903) DistributedShell throwing Errors in logs after successfull completion

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725772#comment-13725772
 ] 

Omkar Vinit Joshi commented on YARN-903:


bq. NMContext is all about shared state. You are sharing state between 
ContainerManager and NodeStatusUpdater. We don't keep any shared state outside 
NMContext. You still having more casting in NodeManager. If you go this route, 
you will have to add clearTrackedFinishedContainersFromCache() to 
NodeStatusUpdater. Seems completely odd to have such a method on 
NodeStatusUpdater. I still prefer NMContext

Yes added method clearTrackedFinishedContainersFromCache() to NodeStatusUpdater.

bq. Yes, the missing break is the main bug. We should add a very simple unit 
test that validates this addition and expiry. Only at NodeStatusUpdater unit 
level is fine enough.
yes adding one..

> DistributedShell throwing Errors in logs after successfull completion
> -
>
> Key: YARN-903
> URL: https://issues.apache.org/jira/browse/YARN-903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
>Affects Versions: 2.0.4-alpha
> Environment: Ununtu 11.10
>Reporter: Abhishek Kapoor
>Assignee: Omkar Vinit Joshi
> Attachments: AppMaster.stderr, YARN-903-20130717.1.patch, 
> YARN-903-20130718.1.patch, YARN-903-20130723.patch, 
> YARN-903-20130729.1.patch, YARN-903-20130730.1.patch, 
> yarn-sunny-nodemanager-sunny-Inspiron.log
>
>
> I have tried running DistributedShell and also used ApplicationMaster of the 
> same for my test.
> The application is successfully running through logging some errors which 
> would be useful to fix.
> Below are the logs from NodeManager and ApplicationMasterode
> Log Snippet for NodeManager
> =
> 2013-07-07 13:39:18,787 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connecting 
> to ResourceManager at localhost/127.0.0.1:9990. current no. of attempts is 1
> 2013-07-07 13:39:19,050 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
>  Rolling master-key for container-tokens, got key with id -325382586
> 2013-07-07 13:39:19,052 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: 
> Rolling master-key for nm-tokens, got key with id :1005046570
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered 
> with ResourceManager as sunny-Inspiron:9993 with total resource of 
> 
> 2013-07-07 13:39:19,053 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying 
> ContainerManager to unblock new container-requests
> 2013-07-07 13:39:35,256 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1373184544832_0001_01 (auth:SIMPLE)
> 2013-07-07 13:39:35,492 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1373184544832_0001_01_01 by user sunny
> 2013-07-07 13:39:35,507 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1373184544832_0001
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny  
> IP=127.0.0.1OPERATION=Start Container Request   
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1373184544832_0001
> CONTAINERID=container_1373184544832_0001_01_01
> 2013-07-07 13:39:35,511 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from NEW to INITING
> 2013-07-07 13:39:35,512 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Adding container_1373184544832_0001_01_01 to application 
> application_1373184544832_0001
> 2013-07-07 13:39:35,518 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1373184544832_0001 transitioned from INITING to 
> RUNNING
> 2013-07-07 13:39:35,528 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1373184544832_0001_01_01 transitioned from NEW to 
> LOCALIZING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://localhost:9000/application/test.jar transitioned from INIT 
> to DOWNLOADING
> 2013-07-07 13:39:35,540 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1373184544832

[jira] [Updated] (YARN-758) TestRMRestart should use MockNMs with multiple cores

2013-07-31 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-758:
--

Summary: TestRMRestart should use MockNMs with multiple cores  (was: Fair 
scheduler has some bug that causes TestRMRestart to fail)

> TestRMRestart should use MockNMs with multiple cores
> 
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1001) YARN should provide per application-type and state statistics

2013-07-31 Thread Srimanth Gunturi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725751#comment-13725751
 ] 

Srimanth Gunturi commented on YARN-1001:


[~zjshen], yes we could filter by app-type + state and count. But this is very 
inefficient as the number of applications could be large with potentially 
paging involved. We dont want to read the huge outputs just to get the counts. 
It would be helpful if state-counts per app-type be provided.

> YARN should provide per application-type and state statistics
> -
>
> Key: YARN-1001
> URL: https://issues.apache.org/jira/browse/YARN-1001
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: api
>Affects Versions: 2.1.0-beta
>Reporter: Srimanth Gunturi
>
> In Ambari we plan to show for MR2 the number of applications finished, 
> running, waiting, etc. It would be efficient if YARN could provide per 
> application-type and state aggregated counts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-808) ApplicationReport does not clearly tell that the attempt is running or not

2013-07-31 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-808:
---

Attachment: YARN-808.1.patch

> ApplicationReport does not clearly tell that the attempt is running or not
> --
>
> Key: YARN-808
> URL: https://issues.apache.org/jira/browse/YARN-808
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-808.1.patch
>
>
> When an app attempt fails and is being retried, ApplicationReport immediately 
> gives the new attemptId and non-null values of host etc. There is no way for 
> clients to know that the attempt is running other than connecting to it and 
> timing out on invalid host. Solution would be to expose the attempt state or 
> return a null value for host instead of "N/A"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-990) YARN REST api needs filtering capability

2013-07-31 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725748#comment-13725748
 ] 

Zhijie Shen commented on YARN-990:
--

Checked the latest trunk, RMWebServices provides the filter of both 
applicationType*s* and state already. Are you looking for more filters or 
demanding that state can accept multiple values?

> YARN REST api needs filtering capability
> 
>
> Key: YARN-990
> URL: https://issues.apache.org/jira/browse/YARN-990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 2.1.0-beta
>Reporter: Srimanth Gunturi
>
> We wanted to find the MR2 apps which were running/finished/etc. There was no 
> filtering capability of the /apps endpoint.
> [http://dev01:8088/ws/v1/cluster/apps?applicationType=MAPREDUCE&state=RUNNING]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1001) YARN should provide per application-type and state statistics

2013-07-31 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725740#comment-13725740
 ] 

Zhijie Shen commented on YARN-1001:
---

[~srimanth.gunturi], I'm not fully sure it will meet Ambari's requirement, but 
it is worth mentioning that actually getApplication() now can get the 
applications of a certain type by supplying the type name. Then the count can 
be easily concluded from the response.

> YARN should provide per application-type and state statistics
> -
>
> Key: YARN-1001
> URL: https://issues.apache.org/jira/browse/YARN-1001
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: api
>Affects Versions: 2.1.0-beta
>Reporter: Srimanth Gunturi
>
> In Ambari we plan to show for MR2 the number of applications finished, 
> running, waiting, etc. It would be efficient if YARN could provide per 
> application-type and state aggregated counts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-770) NPE NodeStatusUpdaterImpl

2013-07-31 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725739#comment-13725739
 ] 

Xuan Gong commented on YARN-770:


I run several test which used a miniYarn cluster, such as testAMRMClient, 
testNMClient. But I did not see that.

> NPE NodeStatusUpdaterImpl
> -
>
> Key: YARN-770
> URL: https://issues.apache.org/jira/browse/YARN-770
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> A mini yarn cluster based test just failed -NPE in the logs in 
> {{NodeStatusUpdaterImpl}}, which is probably a symptom of the problem, not 
> the cause -network trouble more likely there- but it shows there's some extra 
> checking for null responses.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-76) killApplication doesn't fully kill application master on Mac OS

2013-07-31 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725719#comment-13725719
 ] 

Xuan Gong commented on YARN-76:
---

[~bowang] I run the sleep task, and kill this application. Could you tell me 
which command you are running to find out that the application master is not 
killed. Because I run ps -A before and after kill command, and I do not find 
the application master is alive.

> killApplication doesn't fully kill application master on Mac OS
> ---
>
> Key: YARN-76
> URL: https://issues.apache.org/jira/browse/YARN-76
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Failed on MacOS. OK on Linux
>Reporter: Bo Wang
>
> When client sends a ClientRMProtocol#killApplication to RM, the corresponding 
> AM is supposed to be killed. However, on Mac OS, the AM is still alive (w/o 
> any interruption).
> I figured out part of the reason after some debugging. NM starts a AM with 
> command like "/bin/bash -c /path/to/java SampleAM". This command is executed 
> in a process (say with PID 0001), which starts another Java process (say with 
> PID 0002). When NM kills the AM, it send SIGTERM and then SIGKILL to the bash 
> process (PID 0001). In Linux, the death of the bash process (PID 0001) will 
> trigger the kill of the Java process (PID 0002). However, in Mac OS, only the 
> bash process is killed. The Java process is in the wild since then.
> Note: on Mac OS, DefaultContainerExecutor is used rather than 
> LinuxContainerExecutor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1006) Nodes list web page on the RM web UI is broken

2013-07-31 Thread Jian He (JIRA)
Jian He created YARN-1006:
-

 Summary: Nodes list web page on the RM web UI is broken
 Key: YARN-1006
 URL: https://issues.apache.org/jira/browse/YARN-1006
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He


The nodes web page which list all the connected nodes of the cluster is broken.

1. The page is not showing in correct format/style.
2. If we restart the NM, the node list is not refreshed, but just add the new 
started NM to the list. The old NMs information still remain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-957) Capacity Scheduler tries to reserve the memory more than what node manager reports.

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725693#comment-13725693
 ] 

Hadoop QA commented on YARN-957:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12595262/YARN-957-20130731.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1628//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1628//console

This message is automatically generated.

> Capacity Scheduler tries to reserve the memory more than what node manager 
> reports.
> ---
>
> Key: YARN-957
> URL: https://issues.apache.org/jira/browse/YARN-957
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
> Attachments: YARN-957-20130730.1.patch, YARN-957-20130730.2.patch, 
> YARN-957-20130730.3.patch, YARN-957-20130731.1.patch
>
>
> I have 2 node managers.
> * one with 1024 MB memory.(nm1)
> * second with 2048 MB memory.(nm2)
> I am submitting simple map reduce application with 1 mapper and one reducer 
> with 1024mb each. The steps to reproduce this are
> * stop nm2 with 2048MB memory.( This I am doing to make sure that this node's 
> heartbeat doesn't reach RM first).
> * now submit application. As soon as it receives first node's (nm1) heartbeat 
> it will try to reserve memory for AM-container (2048MB). However it has only 
> 1024MB of memory.
> * now start nm2 with 2048 MB memory.
> It hangs forever... Ideally this has two potential issues.
> * It should not try to reserve memory on a node manager which is never going 
> to give requested memory. i.e. Current max capability of node manager is 
> 1024MB but 2048MB is reserved on it. But it still does that.
> * Say 2048MB is reserved on nm1 but nm2 comes back with 2048MB available 
> memory. In this case if the original request was made without any locality 
> then scheduler should unreserve memory on nm1 and allocate requested 2048MB 
> container on nm2.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-758) Fair scheduler has some bug that causes TestRMRestart to fail

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725683#comment-13725683
 ] 

Hadoop QA commented on YARN-758:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595260/yarn-758-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1627//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1627//console

This message is automatically generated.

> Fair scheduler has some bug that causes TestRMRestart to fail
> -
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-957) Capacity Scheduler tries to reserve the memory more than what node manager reports.

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-957:
---

Attachment: YARN-957-20130731.1.patch

> Capacity Scheduler tries to reserve the memory more than what node manager 
> reports.
> ---
>
> Key: YARN-957
> URL: https://issues.apache.org/jira/browse/YARN-957
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
> Attachments: YARN-957-20130730.1.patch, YARN-957-20130730.2.patch, 
> YARN-957-20130730.3.patch, YARN-957-20130731.1.patch
>
>
> I have 2 node managers.
> * one with 1024 MB memory.(nm1)
> * second with 2048 MB memory.(nm2)
> I am submitting simple map reduce application with 1 mapper and one reducer 
> with 1024mb each. The steps to reproduce this are
> * stop nm2 with 2048MB memory.( This I am doing to make sure that this node's 
> heartbeat doesn't reach RM first).
> * now submit application. As soon as it receives first node's (nm1) heartbeat 
> it will try to reserve memory for AM-container (2048MB). However it has only 
> 1024MB of memory.
> * now start nm2 with 2048 MB memory.
> It hangs forever... Ideally this has two potential issues.
> * It should not try to reserve memory on a node manager which is never going 
> to give requested memory. i.e. Current max capability of node manager is 
> 1024MB but 2048MB is reserved on it. But it still does that.
> * Say 2048MB is reserved on nm1 but nm2 comes back with 2048MB available 
> memory. In this case if the original request was made without any locality 
> then scheduler should unreserve memory on nm1 and allocate requested 2048MB 
> container on nm2.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-758) Fair scheduler has some bug that causes TestRMRestart to fail

2013-07-31 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-758:
--

Attachment: yarn-758-1.patch

Added a constructor to MockNM that takes vcores as well, and updated 
TestRMRestart to use that.

> Fair scheduler has some bug that causes TestRMRestart to fail
> -
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: yarn-758-1.patch
>
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725652#comment-13725652
 ] 

Arun C Murthy commented on YARN-972:


I'll repeat what I said in YARN-2, this is a bad idea for the same reasons I 
said there:
# Fractional arithmetic is expensive - particularly in java; again, see 
MAPREDUCE-1354. Currently CS can fill up decent sized clusters in <100ms. I'm 
willing to bet this will be more expensive - I'd like to see benchmarks before 
we argue against it. Also instead of multiplying by 1000 etc., we can increase 
#vcores.
# vcore doesn't mean it isn't predictable across clusters - I'd support an 
enhancement which puts in a specific value for a vcore ala EC2 (2009 xeon etc., 
whatever we want to pick).

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-985) Nodemanager should log where a resource was localized

2013-07-31 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725610#comment-13725610
 ] 

Ravi Prakash commented on YARN-985:
---

Hi Omkar!
bq.  yeah this is a small thing but would have preferred logging in successful 
state transition
Is there a reason for your preference? The reason I put it there was because 
this way I am piggybacking on an already existing Log statement. Granted it 
shouldn't be much of a performance bottleneck (unless its a fat node launching 
a lot of containers which are localizing a lot of files), but there's a 
tangible reason for why I chose to do it that way. Even logging has performance 
implications as we saw in HDFS-4080.

bq. You can also have a debug log when file gets removed from cache to see if 
it is deleted or not. LocalResourcesTrackerImpl.java
I'm sorry I missed your suggestion in the original post. That's a good 
suggestion. I'll update the patch to Log.info() when it removes the 
LocalizedResource.

> Nodemanager should log where a resource was localized
> -
>
> Key: YARN-985
> URL: https://issues.apache.org/jira/browse/YARN-985
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0, 2.0.4-alpha, 0.23.9
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-985.patch
>
>
> When a resource is localized, we should log WHERE on the local disk it was 
> localized. This helps in debugging afterwards (e.g. if the disk was to go 
> bad).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725556#comment-13725556
 ] 

Hitesh Shah commented on YARN-1004:
---

[~vinodkv] min allocation is no longer visible to an application. An 
application asks for a certain resource size and will be given either the exact 
size or something bigger. How much bigger is no longer informed to the 
application. 

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725571#comment-13725571
 ] 

Sangjin Lee commented on YARN-573:
--

+1 on wrapping the list with Collections.synchronizedList(). That would make 
the intent bit clearer. Then you can drop synchronization on the add() call 
(you'd still need to use explicit synchronization for the iteration as [~jlowe] 
pointed out).

An alternative (which may be slightly more concurrent) is to use 
ConcurrentLinkedQueue. You would drop back from List to Queue, but that's all 
you need anyway. Besides, you would no longer need to use synchronization.

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725544#comment-13725544
 ] 

Jason Lowe commented on YARN-573:
-

bq. I thought about it earlier but we are using iterator internally and we are 
modifying list using that iterator which won't be thread safe. Let me know if 
we should use Collections.synchronizedList or should synchronize on list?

It's OK to iterate over a SynchronizedList as long as one explicitly 
synchronizes the list while iterating.  This is called out in the javadocs for 
SynchronizedList.  Synchronizing on the list will effectively block all other 
threads attempting to access the list until the iteration completes, because 
SynchronizedList methods end up using {{this}} as a mutex.

bq. Yes you are right we should change the constructor to use ConcurrentMap. I 
will fix it together with above question/comment.

I was not so much thinking the constructor should take a ConcurrentMap so much 
as thinking that particular constructor should simply be removed.  It's not 
called by anything else other than the simpler constructor form, and we can 
just have that constructor create the ConcurrentMap directly when it 
initializes the {{pending}} field.

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725530#comment-13725530
 ] 

Vinod Kumar Vavilapalli commented on YARN-1004:
---

bq.  maximum-allocation is just for consistency. My thought is that it should 
be scheduler-specific because it's up to the scheduler to honor the config. 
Someone could write a new scheduler and not handle it.
I haven't been following the FifoScheduler changes but this is wrong. All 
schedulers should honor this. Otherwise app-writers won't know what can be 
honoured and what cannot be.

Seems like it is already agreed that min is specific to scheduler. Even there 
I'd make the same argument.

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725528#comment-13725528
 ] 

Allen Wittenauer commented on YARN-972:
---

bq.  Nodes in probably the majority of clusters are configured with more slots 
than cores. This is sensible because many types of task do a lot of IO and do 
not even saturate half of a single core. 

I disagree. It isn't a sensible thing to do at all unless it *also* schedules 
based upon IO characteristics in addition to processor needs.  The system 
eventually ends up in a death spiral:

P1: "We need more processes on this machine because the load isn't high!"

P2: "OK!  I've put more of our IO intensive processes on this machine!"

P1: "Weird!  The CPUs are now spending more time in IO wait!  Let's add more 
processes since we have more CPU to get it higher!"

...

I posit that the reason why (at least in Hadoop 1.x systems) there are more 
tasks per cores is simple: the jobs are crap.  They are spending more time 
launching JVMs and getting scheduled than they are actually executing code. It 
gives the illusion that Hadoop isn't scheduling efficiently. Unless one 
recognizes that there is a tipping point in parallelism, most users are going 
to keep increasing it in blind faith that "more tasks = faster always". 


Also, yes, I want YARN-796, but I don't think that's an orthogonal discussion.  
My opinion is that they are different facets of the same discussion: how do we 
properly schedule in a mixed load environment.  It's very hard to get it 100% 
efficient for all cases.  Some folks are going to have to suffer.  If I had to 
pick, let it be the folks with workloads that are either terribly written or 
sleep a lot and don't require a lot of processor when they do wake up.

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725522#comment-13725522
 ] 

Sandy Ryza commented on YARN-1004:
--

[~hitesh], your reasoning makes sense to me.  I'll leave the maximum and just 
update the minimum and increment.

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725526#comment-13725526
 ] 

Zhijie Shen commented on YARN-966:
--

bq. Potentially I don't see when we will in fact start ContainerLaunch#call 
without its all resources getting downloaded.

YARN-906 is such a corner case.

bq. This I still see should not be done via NULL check. Proper way is to set 
boolean flag of ContainerLaunch in the event of KILL synchronously.

The original code checks state == LOCALIZED, and throws AssertError when 
getting the localized resources. I just modified the way to indicate the error, 
such that the callers of it can more easily handle the error. If you think 
calling getLocalizedResources() when the container is not at LOCALIZED is not 
wrong, I'm afraid we're in the different conversation.

bq. which is completely misleading.. Indeed this occurred because user killed 
container not because it failed to localize resources.

I don't think the message is misleading. Again, getLocalizedResources() is not 
allowed to be called when the container is not at LOCALIZED (at least the 
original code means it). So the message clearly states problem. Please note 
that killing signal is not the root problem of the thread failure here. If 
getLocalizedResources() were not called, the thread would still complete 
without exception. 
 

> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725518#comment-13725518
 ] 

Vinod Kumar Vavilapalli commented on YARN-966:
--

bq. Potentially I don't see when we will in fact start ContainerLaunch#call 
without its all resources getting downloaded.
This is the most important point.

bq. This I still see should not be done via NULL check. Proper way is to set 
boolean flag of ContainerLaunch in the event of KILL synchronously. 
bq. which is completely misleading.. Indeed this occurred because user killed 
container not because it failed to localize resources.
I think we are beating this down to death. Like I said, this error SHOULD NOT 
happen in practice. I don't know whey the assert was originally put in place. 
That said, I didn't want to blindly remove it without knowing why it was there 
to begin with. If ever we run into this in real life, we can fix the message.

> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725517#comment-13725517
 ] 

Hitesh Shah commented on YARN-1004:
---

[~sandyr] Scheduler-specific configs are fine as long they don't affect the 
apis and how an app needs to be written.

The reason I mentioned max is that max is currently exposed in the api and 
therefore it requires either to be in the RM-config or an enforced config 
property of each scheduler impl. 

The question of max being a scheduler-specific implementation choice of whether 
to handle it or not seems wrong. Based on the current api, it is a defined 
contract between an app and YARN that a container greater than max will not be 
allocated. Having one scheduler enforce that contract and another not enforce 
means that applications now need to know what scheduler is running and change 
their code/run-time flow accordingly. That is a huge problem for developers 
trying to write applications on YARN.



 

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-573:
---

Description: 
PublicLocalizer
1) pending accessed by addResource (part of event handling) and run method (as 
a part of PublicLocalizer.run() ).

PrivateLocalizer
1) pending accessed by addResource (part of event handling) and 
findNextResource (i.remove()). Also update method should be fixed. It too is 
sharing pending list.


  was:
PublicLocalizer
1) pending accessed by addResource (part of event handling) and run method (as 
a part of PublicLocalizer.run() ).

PrivateLocalizer
1) pending accessed by addResource (part of event handling) and 
findNextResource (i.remove()).



> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-955) [YARN-321] History Service should create the RPC server and wire it to HistoryStorage

2013-07-31 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725506#comment-13725506
 ] 

Mayank Bansal commented on YARN-955:


Taking it over

> [YARN-321] History Service should create the RPC server and wire it to 
> HistoryStorage
> -
>
> Key: YARN-955
> URL: https://issues.apache.org/jira/browse/YARN-955
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-955) [YARN-321] History Service should create the RPC server and wire it to HistoryStorage

2013-07-31 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal reassigned YARN-955:
--

Assignee: Mayank Bansal  (was: Vinod Kumar Vavilapalli)

> [YARN-321] History Service should create the RPC server and wire it to 
> HistoryStorage
> -
>
> Key: YARN-955
> URL: https://issues.apache.org/jira/browse/YARN-955
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-987) Implementation of *HistoryData classes to convert to *Report Objects

2013-07-31 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-987:
---

Summary: Implementation of *HistoryData classes to convert to *Report 
Objects  (was: Read Interface Implementation of HistoryStorage for AHS)

> Implementation of *HistoryData classes to convert to *Report Objects
> 
>
> Key: YARN-987
> URL: https://issues.apache.org/jira/browse/YARN-987
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725505#comment-13725505
 ] 

Omkar Vinit Joshi commented on YARN-573:


[~jlowe] Thanks for reviewing..
bq. LocalizerRunner.pending is accessed without synchronization in the update() 
method. Maybe it would be simpler to just use a SynchronizedList wrapper? That 
would make it a bit more robust in light of maintenance changes in the future 
as well.
Yeah my bad.. missed update call... that should be fixed...regarding using 
synchronized list; I thought about it earlier but we are using iterator 
internally and we are modifying list using that iterator which won't be thread 
safe. Let me know if we should use Collections.synchronizedList or should 
synchronize on list? Correct me if I am wrong anywhere.

bq. Nit: The PublicLocalizer constructor that takes a Map isn't really used, 
and as we know pending can't be just any Map for it to work properly. I'd be 
tempted to remove that constructor, but it's not a necessary change.
Yes you are right we should change the constructor to use ConcurrentMap. I will 
fix it together with above question/comment.

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725489#comment-13725489
 ] 

Omkar Vinit Joshi commented on YARN-966:


So if user say kills the container what error we will see is 
{code}
+RPCUtil.getRemoteException(
+"Unable to get local resources when Container " + containerID +
+" is at " + container.getContainerState());
{code}
which is completely misleading.. Indeed this occurred because user killed 
container not because it failed to localize resources.

> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725483#comment-13725483
 ] 

Omkar Vinit Joshi commented on YARN-966:


bq. One more consideration. Empty map can means the case that the container is 
at LOCALIZED, but actually there's no localized resources. Returning null is to 
distinguish this case with the case of fetch the localized resources when the 
container is not at LOCALIZED.
This assumption is wrong. State of container has nothing to do with localized 
resources map. we can call getState and know its state irrespective of this 
null check. Potentially I don't see when we will in fact start 
ContainerLaunch#call without its all resources getting downloaded. Its 
different issue that user  may kill the container resulting into a state 
transition. This I still see should not be done via NULL check. Proper way is 
to set boolean flag of ContainerLaunch in the event of KILL synchronously. 

> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725476#comment-13725476
 ] 

Sandy Ryza commented on YARN-1004:
--

[~hitesh], maximum-allocation is just for consistency.  My thought is that it 
should be scheduler-specific because it's up to the scheduler to honor the 
config.  Someone could write a new scheduler and not handle it.  We have other 
configs, such as node-locality-threshold, that function the same for the Fair 
and Capacity schedulers as well.

[~bikassaha], I think this is important in that in my experience having these 
properties that function differently for different schedulers has made 
explaining resource configuration really difficult.  I didn't want to delay the 
release, but I'll upload a patch today without deprecations and we can decide 
where to go from there.

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-602) NodeManager should mandatorily set some Environment variables into every containers that it launches

2013-07-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725475#comment-13725475
 ] 

Vinod Kumar Vavilapalli commented on YARN-602:
--

bq.  How can I fix putting env in windows case? Is it elevant to 
Environment.USER(USERNAME in Windows)?
That's correct. Environment.USER automatically resolves correctly depending on 
the OS.

> NodeManager should mandatorily set some Environment variables into every 
> containers that it launches
> 
>
> Key: YARN-602
> URL: https://issues.apache.org/jira/browse/YARN-602
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Kenji Kikushima
> Attachments: YARN-602.patch
>
>
> NodeManager should mandatorily set some Environment variables into every 
> containers that it launches, such as Environment.user, Environment.pwd. If 
> both users and NodeManager set those variables, the value set by NM should be 
> used 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725459#comment-13725459
 ] 

Sandy Ryza commented on YARN-972:
-

[~ste...@apache.org],
bq. Those caches are the key to performance, and if you are trying to 
overburden the cores with work then its the cache miss penalty that kills the 
jobs.
This is routinely done on clusters already.  Nodes in probably the majority of 
clusters are configured with more slots than cores.  This is sensible because 
many types of task do a lot of IO and do not even saturate half of a single 
core. 

bq. Optimising for todays 4-8 cores is a premature optimisation.
In what way are we optimising for 4-8 cores?

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-107) ClientRMService.forceKillApplication() should handle the non-RUNNING applications properly

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725429#comment-13725429
 ] 

Hadoop QA commented on YARN-107:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595207/YARN-107.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1625//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1625//console

This message is automatically generated.

> ClientRMService.forceKillApplication() should handle the non-RUNNING 
> applications properly
> --
>
> Key: YARN-107
> URL: https://issues.apache.org/jira/browse/YARN-107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Devaraj K
>Assignee: Xuan Gong
> Attachments: YARN-107.1.patch, YARN-107.2.patch, YARN-107.3.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725427#comment-13725427
 ] 

Alejandro Abdelnur commented on YARN-972:
-

Allen, what you want seems to be YARN-796.

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-643) WHY appToken is removed both in BaseFinalTransition and AMUnregisteredTransition AND clientToken is removed in FinalTransition and not BaseFinalTransition

2013-07-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725423#comment-13725423
 ] 

Hadoop QA commented on YARN-643:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595205/YARN-643.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1626//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1626//console

This message is automatically generated.

> WHY appToken is removed both in BaseFinalTransition and 
> AMUnregisteredTransition AND clientToken is removed in FinalTransition and 
> not BaseFinalTransition
> --
>
> Key: YARN-643
> URL: https://issues.apache.org/jira/browse/YARN-643
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Xuan Gong
> Attachments: YARN-643.1.patch, YARN-643.2.patch
>
>
> The jira is tracking why appToken and clientToAMToken is removed separately, 
> and why they are distributed in different transitions, ideally there may be a 
> common place where these two tokens can be removed at the same time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725426#comment-13725426
 ] 

Allen Wittenauer commented on YARN-972:
---

One other thing, before I forget... these requests should also be able to be 
tied to queues.  i.e., I should be able to set up a queue that only has to and 
only allows workloads that require 4GHz processors.  Otherwise making this a 
free-for-all for users is going to turn into "who can request the fastest 
machines first" death match.  By tying this functionality to queues, the ops 
team have the capability to control who gets the best gear as needed by the 
business.

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725413#comment-13725413
 ] 

Allen Wittenauer commented on YARN-972:
---

Do we have an example of a workload that needs 'fractional cores'?  Is this 
workload even appropriate for Hadoop?

As one of the probably crazy people who supports systems that do extremely 
non-MR things at large scales, I'd prefer to see two things implemented:
* I need a processor this fast (in GHz)
* I need a processor that supports this instruction set

But I'd position the GHz question differently than what has been proposed 
above.  If I say my workload needs a 1GHz processor but there is only a 4GHz 
processor available, then the workflow would get the whole 4GHz processor. If 
another workload comes in that needs a 4GHz processor but only a 2GHz processor 
is available, it needs to wait.  

Treating speed as fractions gets into another problem:  2x2GHz != 4GHz.  Just 
as having 1/4 of 4 different cores != 1 core. Throw cpu sets into the mix and 
we've got a major hairball.

Also, I'm a bit leery of our usage of the term core here and elsewhere in 
Hadoop-land. As [~ste...@apache.org] points out, there are impacts on the Lx 
caches when sharing load.  This is also true when talking about most SMT 
implementations, such as Intel's HyperThreading.  This means if we're talking 
about the Linux (and most other OSes) representation of CPU threads as being 
equivalent cores, there is *already* a performance hit and users are *already* 
getting fractional performance just by treating those as "real" cores.

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725395#comment-13725395
 ] 

Bikas Saha commented on YARN-1004:
--

It would be sad to have deprecated configs when YARN is still trying to go 
beta. If this is really important, lets fix it and include it in the beta RC.

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-107) ClientRMService.forceKillApplication() should handle the non-RUNNING applications properly

2013-07-31 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-107:
---

Attachment: YARN-107.3.patch

Fix -1 release audit.

> ClientRMService.forceKillApplication() should handle the non-RUNNING 
> applications properly
> --
>
> Key: YARN-107
> URL: https://issues.apache.org/jira/browse/YARN-107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Devaraj K
>Assignee: Xuan Gong
> Attachments: YARN-107.1.patch, YARN-107.2.patch, YARN-107.3.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-643) WHY appToken is removed both in BaseFinalTransition and AMUnregisteredTransition AND clientToken is removed in FinalTransition and not BaseFinalTransition

2013-07-31 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-643:
---

Attachment: YARN-643.2.patch

> WHY appToken is removed both in BaseFinalTransition and 
> AMUnregisteredTransition AND clientToken is removed in FinalTransition and 
> not BaseFinalTransition
> --
>
> Key: YARN-643
> URL: https://issues.apache.org/jira/browse/YARN-643
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Xuan Gong
> Attachments: YARN-643.1.patch, YARN-643.2.patch
>
>
> The jira is tracking why appToken and clientToAMToken is removed separately, 
> and why they are distributed in different transitions, ideally there may be a 
> common place where these two tokens can be removed at the same time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

2013-07-31 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725372#comment-13725372
 ] 

Steve Loughran commented on YARN-972:
-

I'd argue against fractional core assignment not on CPU or FPU grounds but in 
$L1, $L2 and $L3 hit. Those caches are the key to performance, and if you are 
trying to overburden the cores with work then its the cache miss penalty that 
kills the jobs.

MR dodges this by having most tasks regularly blocking for IO operations, but 
other workloads have different characteristics.

Also, as Timothy Points out, noone is going to be releasing CPUs with less 
cores on them: the #will only increase. Optimising for todays 4-8 cores is a 
premature optimisation.

> Allow requests and scheduling for fractional virtual cores
> --
>
> Key: YARN-972
> URL: https://issues.apache.org/jira/browse/YARN-972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-997) JMX support for node resource configuration

2013-07-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725310#comment-13725310
 ] 

Junping Du commented on YARN-997:
-

I think Luke already address you previous question on HADOOP-9160. So I put 
HADOOP-9160 as blocker for this jira, so we can discuss any concern on JMX 
there before we move forward on this jira.

> JMX support for node resource configuration
> ---
>
> Key: YARN-997
> URL: https://issues.apache.org/jira/browse/YARN-997
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, scheduler
>Reporter: Junping Du
>
> Beside YARN CLI and REST API, we can enable JMX interface to change node's 
> resource.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725299#comment-13725299
 ] 

Hitesh Shah commented on YARN-1004:
---

Is there a reason why maximum-allocation is also scheduler specific?

> yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler
> 
>
> Key: YARN-1004
> URL: https://issues.apache.org/jira/browse/YARN-1004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>
> As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
> configuration, and functions differently for the Fair and Capacity 
> schedulers, it would be less confusing for the config names to include the 
> scheduler names, i.e. yarn.scheduler.fair.minimum-allocation-mb, 
> yarn.scheduler.capacity.minimum-allocation-mb, and 
> yarn.scheduler.fifo.minimum-allocation-mb.
> The same goes for yarn.scheduler.increment-allocation-mb, which only exists 
> for the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for 
> consistency.
> If we wish to preserve backwards compatibility, we can deprecate the old 
> configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-602) NodeManager should mandatorily set some Environment variables into every containers that it launches

2013-07-31 Thread Kenji Kikushima (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725296#comment-13725296
 ] 

Kenji Kikushima commented on YARN-602:
--

Sorry, late for update and windows case error. I have only Linux and OS X 
environment.
How can I fix putting env in windows case? Is it elevant to 
Environment.USER(USERNAME in Windows)?

> NodeManager should mandatorily set some Environment variables into every 
> containers that it launches
> 
>
> Key: YARN-602
> URL: https://issues.apache.org/jira/browse/YARN-602
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Kenji Kikushima
> Attachments: YARN-602.patch
>
>
> NodeManager should mandatorily set some Environment variables into every 
> containers that it launches, such as Environment.user, Environment.pwd. If 
> both users and NodeManager set those variables, the value set by NM should be 
> used 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-948) RM should validate the release container list before actually releasing them

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725278#comment-13725278
 ] 

Hudson commented on YARN-948:
-

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1504 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1504/])
YARN-948. Changed ResourceManager to validate the release container list before 
actually releasing them. Contributed by Omkar Vinit Joshi. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508609)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidContainerReleaseException.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationmasterservice/TestApplicationMasterService.java


> RM should validate the release container list before actually releasing them
> 
>
> Key: YARN-948
> URL: https://issues.apache.org/jira/browse/YARN-948
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
> Fix For: 2.1.1-beta
>
> Attachments: YARN-948-20130724.patch, YARN-948-20130726.1.patch, 
> YARN-948-20130729.1.patch
>
>
> At present we are blinding passing the allocate request containing containers 
> to be released to the scheduler. This may result into one application 
> releasing another application's container.
> {code}
>   @Override
>   @Lock(Lock.NoLock.class)
>   public Allocation allocate(ApplicationAttemptId applicationAttemptId,
>   List ask, List release, 
>   List blacklistAdditions, List blacklistRemovals) {
> FiCaSchedulerApp application = getApplication(applicationAttemptId);
> 
> 
> // Release containers
> for (ContainerId releasedContainerId : release) {
>   RMContainer rmContainer = getRMContainer(releasedContainerId);
>   if (rmContainer == null) {
>  RMAuditLogger.logFailure(application.getUser(),
>  AuditConstants.RELEASE_CONTAINER, 
>  "Unauthorized access or invalid container", "CapacityScheduler",
>  "Trying to release container not owned by app or with invalid 
> id",
>  application.getApplicationId(), releasedContainerId);
>   }
>   completedContainer(rmContainer,
>   SchedulerUtils.createAbnormalContainerStatus(
>   releasedContainerId, 
>   SchedulerUtils.RELEASED_CONTAINER),
>   RMContainerEventType.RELEASED);
> }
> {code}
> Current checks are not sufficient and we should prevent this. thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725275#comment-13725275
 ] 

Hudson commented on YARN-966:
-

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1504 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1504/])
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized 
resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java


> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-07-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725272#comment-13725272
 ] 

Jason Lowe commented on YARN-573:
-

Thanks for picking this up, Omkar.  A couple of comments:

* LocalizerRunner.pending is accessed without synchronization in the update() 
method.  Maybe it would be simpler to just use a SynchronizedList wrapper?  
That would make it a bit more robust in light of maintenance changes in the 
future as well.
* Nit: The PublicLocalizer constructor that takes a Map isn't really used, and 
as we know {{pending}} can't be just any Map for it to work properly.  I'd be 
tempted to remove that constructor, but it's not a necessary change.

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Attachments: YARN-573-20130730.1.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-107) ClientRMService.forceKillApplication() should handle the non-RUNNING applications properly

2013-07-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725237#comment-13725237
 ] 

Jason Lowe commented on YARN-107:
-

I still think throwing an exception for this is a mistake and makes the API 
harder to wield.  What is the use-case where throwing the exception is 
necessary?

> ClientRMService.forceKillApplication() should handle the non-RUNNING 
> applications properly
> --
>
> Key: YARN-107
> URL: https://issues.apache.org/jira/browse/YARN-107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Devaraj K
>Assignee: Xuan Gong
> Attachments: YARN-107.1.patch, YARN-107.2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725211#comment-13725211
 ] 

Hudson commented on YARN-966:
-

FAILURE: Integrated in Hadoop-Hdfs-trunk #1477 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1477/])
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized 
resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java


> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-948) RM should validate the release container list before actually releasing them

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725215#comment-13725215
 ] 

Hudson commented on YARN-948:
-

FAILURE: Integrated in Hadoop-Hdfs-trunk #1477 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1477/])
YARN-948. Changed ResourceManager to validate the release container list before 
actually releasing them. Contributed by Omkar Vinit Joshi. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508609)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidContainerReleaseException.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationmasterservice/TestApplicationMasterService.java


> RM should validate the release container list before actually releasing them
> 
>
> Key: YARN-948
> URL: https://issues.apache.org/jira/browse/YARN-948
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
> Fix For: 2.1.1-beta
>
> Attachments: YARN-948-20130724.patch, YARN-948-20130726.1.patch, 
> YARN-948-20130729.1.patch
>
>
> At present we are blinding passing the allocate request containing containers 
> to be released to the scheduler. This may result into one application 
> releasing another application's container.
> {code}
>   @Override
>   @Lock(Lock.NoLock.class)
>   public Allocation allocate(ApplicationAttemptId applicationAttemptId,
>   List ask, List release, 
>   List blacklistAdditions, List blacklistRemovals) {
> FiCaSchedulerApp application = getApplication(applicationAttemptId);
> 
> 
> // Release containers
> for (ContainerId releasedContainerId : release) {
>   RMContainer rmContainer = getRMContainer(releasedContainerId);
>   if (rmContainer == null) {
>  RMAuditLogger.logFailure(application.getUser(),
>  AuditConstants.RELEASE_CONTAINER, 
>  "Unauthorized access or invalid container", "CapacityScheduler",
>  "Trying to release container not owned by app or with invalid 
> id",
>  application.getApplicationId(), releasedContainerId);
>   }
>   completedContainer(rmContainer,
>   SchedulerUtils.createAbnormalContainerStatus(
>   releasedContainerId, 
>   SchedulerUtils.RELEASED_CONTAINER),
>   RMContainerEventType.RELEASED);
> }
> {code}
> Current checks are not sufficient and we should prevent this. thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1005) Log aggregators should check for FSDataOutputStream close before renaming to aggregated file.

2013-07-31 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-1005:
---

 Summary: Log aggregators should check for FSDataOutputStream close 
before renaming to aggregated file.
 Key: YARN-1005
 URL: https://issues.apache.org/jira/browse/YARN-1005
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Rohith Sharma K S


If AggregatedLogFormat.LogWriter.closeWriter() is interuppted, then 
"remoteNodeTmpLogFileForApp" is renamed to "remoteNodeLogFileForApp" file. This 
renamed file does not contain valid aggregated logs. There can be situation 
renamed file can be not in BCFile format. 

This cause issue while viewing from JobHistoryServer web page.

{noformat}
2013-07-27 18:51:14,787 ERROR org.apache.hadoop.yarn.webapp.View: Error getting 
logs for job_1374918614757_0002
java.io.IOException: Not a valid BCFile.
at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:628)
at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:337)
at 
org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:89)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:64)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:74)
{noformat}


 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-445) Ability to signal containers

2013-07-31 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725111#comment-13725111
 ] 

Steve Loughran commented on YARN-445:
-

I like Chris's #3 option, as it allows you to add things like a graceful 
shutdown to a piece of code that you don't want to/can't change. the command 
would have to run with the same path & other env params as the original source 
if you want to do things like exec an HBase decommission command

> Ability to signal containers
> 
>
> Key: YARN-445
> URL: https://issues.apache.org/jira/browse/YARN-445
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>
> It would be nice if an ApplicationMaster could send signals to contaniers 
> such as SIGQUIT, SIGUSR1, etc.
> For example, in order to replicate the jstack-on-task-timeout feature 
> implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
> interface for sending SIGQUIT to a container.  For that specific feature we 
> could implement it as an additional field in the StopContainerRequest.  
> However that would not address other potential features like the ability for 
> an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
> latter feature would be a very useful debugging tool for users who do not 
> have shell access to the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-770) NPE NodeStatusUpdaterImpl

2013-07-31 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725109#comment-13725109
 ] 

Steve Loughran commented on YARN-770:
-

I was just running a test which used a MiniYARN cluster; it showed this NPE in 
the stack when running a test case while the network was playing up. I haven't 
seen it since that single incident

Is there a codepath that could lead to this NPE-ing condition if some previous 
operation failed, with timeout or IOException?

> NPE NodeStatusUpdaterImpl
> -
>
> Key: YARN-770
> URL: https://issues.apache.org/jira/browse/YARN-770
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> A mini yarn cluster based test just failed -NPE in the logs in 
> {{NodeStatusUpdaterImpl}}, which is probably a symptom of the problem, not 
> the cause -network trouble more likely there- but it shows there's some extra 
> checking for null responses.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-966) The thread of ContainerLaunch#call will fail without any signal if getLocalizedResources() is called when the container is not at LOCALIZED

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725099#comment-13725099
 ] 

Hudson commented on YARN-966:
-

SUCCESS: Integrated in Hadoop-Yarn-trunk #287 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/287/])
YARN-966. Fixed ContainerLaunch to not fail quietly when there are no localized 
resources due to some other failure. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508688)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java


> The thread of ContainerLaunch#call will fail without any signal if 
> getLocalizedResources() is called when the container is not at LOCALIZED
> ---
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), 
> which is scheduled on a separate thread. If the container is not at LOCALIZED 
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and 
> fails the thread without notifying NM. Therefore, the container cannot 
> receive more events, which are supposed to be sent from 
> ContainerLaunch.call(), and move towards completion. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-948) RM should validate the release container list before actually releasing them

2013-07-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725103#comment-13725103
 ] 

Hudson commented on YARN-948:
-

SUCCESS: Integrated in Hadoop-Yarn-trunk #287 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/287/])
YARN-948. Changed ResourceManager to validate the release container list before 
actually releasing them. Contributed by Omkar Vinit Joshi. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1508609)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidContainerReleaseException.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationmasterservice/TestApplicationMasterService.java


> RM should validate the release container list before actually releasing them
> 
>
> Key: YARN-948
> URL: https://issues.apache.org/jira/browse/YARN-948
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
> Fix For: 2.1.1-beta
>
> Attachments: YARN-948-20130724.patch, YARN-948-20130726.1.patch, 
> YARN-948-20130729.1.patch
>
>
> At present we are blinding passing the allocate request containing containers 
> to be released to the scheduler. This may result into one application 
> releasing another application's container.
> {code}
>   @Override
>   @Lock(Lock.NoLock.class)
>   public Allocation allocate(ApplicationAttemptId applicationAttemptId,
>   List ask, List release, 
>   List blacklistAdditions, List blacklistRemovals) {
> FiCaSchedulerApp application = getApplication(applicationAttemptId);
> 
> 
> // Release containers
> for (ContainerId releasedContainerId : release) {
>   RMContainer rmContainer = getRMContainer(releasedContainerId);
>   if (rmContainer == null) {
>  RMAuditLogger.logFailure(application.getUser(),
>  AuditConstants.RELEASE_CONTAINER, 
>  "Unauthorized access or invalid container", "CapacityScheduler",
>  "Trying to release container not owned by app or with invalid 
> id",
>  application.getApplicationId(), releasedContainerId);
>   }
>   completedContainer(rmContainer,
>   SchedulerUtils.createAbnormalContainerStatus(
>   releasedContainerId, 
>   SchedulerUtils.RELEASED_CONTAINER),
>   RMContainerEventType.RELEASED);
> }
> {code}
> Current checks are not sufficient and we should prevent this. thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1004) yarn.scheduler.minimum|maximum|increment-allocation-mb should have scheduler

2013-07-31 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-1004:


 Summary: yarn.scheduler.minimum|maximum|increment-allocation-mb 
should have scheduler
 Key: YARN-1004
 URL: https://issues.apache.org/jira/browse/YARN-1004
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza


As yarn.scheduler.minimum-allocation-mb is now a scheduler-specific 
configuration, and functions differently for the Fair and Capacity schedulers, 
it would be less confusing for the config names to include the scheduler names, 
i.e. yarn.scheduler.fair.minimum-allocation-mb, 
yarn.scheduler.capacity.minimum-allocation-mb, and 
yarn.scheduler.fifo.minimum-allocation-mb.

The same goes for yarn.scheduler.increment-allocation-mb, which only exists for 
the Fair Scheduler, and yarn.scheduler.maximum-allocation-mb, for consistency.

If we wish to preserve backwards compatibility, we can deprecate the old 
configs to the new ones. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >