[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931483#comment-13931483
 ] 

Mayank Bansal commented on YARN-1389:
-

Thanks [~zjshen] for review

bq. 1. Please remove this comment
Done

bq. 2. We'd better not, otherwise, when AHS is disabled, if the object is not 
found in RM, we will get this annoying exception. Please follow YOUR previous 
code pattern in YarnClientImpl#getApplicationReport
Done

bq. 3. Shouldn't you throw ApplicationNotFoundException, as what you did in 
getContainerReport()?
we pass attemt id for containers so its better to have AttemptNotFoundEXception.

bq. Please remove this code. attempts will never be null. It can be empty, but 
it's reasonable. If this application even hasn't its first attempt, the list is 
empty
Done (Changed the check to empty)

bq. 5. Similarly, rmContainers can't be null. After YARN-1794, we should have 
some walk-around to get the finished containers from RM.
I think till we fixed this we need to have this check, I may be removing it 
part of this JIRA.

bq. 6. Just return null. Let UI to decide is going to be printed if the 
diagnostics is not available.
Done

Thanks,
Mayank

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-1389:


Attachment: YARN-1389-8.patch

Updating latest Patch.

Test Faiure is not due to this patch.

Thanks,
Mayank

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931602#comment-13931602
 ] 

Hadoop QA commented on YARN-1389:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634119/YARN-1389-8.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3331//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3331//console

This message is automatically generated.

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1821) NPE on registerNodeManager if the request has containers for UnmanagedAMs

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931636#comment-13931636
 ] 

Hudson commented on YARN-1821:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #507 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/507/])
YARN-1821. NPE on registerNodeManager if the request has containers for 
UnmanagedAMs (kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576525)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


> NPE on registerNodeManager if the request has containers for UnmanagedAMs
> -
>
> Key: YARN-1821
> URL: https://issues.apache.org/jira/browse/YARN-1821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.4.0
>
> Attachments: yarn-1821-1.patch
>
>
> On RM restart (or failover), NM re-registers with the RM. If it was running 
> containers for Unmanaged AMs, it runs into the following NPE:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:213)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1800) YARN NodeManager with java.util.concurrent.RejectedExecutionException

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931637#comment-13931637
 ] 

Hudson commented on YARN-1800:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #507 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/507/])
YARN-1800. Fixed NodeManager to gracefully handle RejectedExecutionException in 
the public-localizer thread-pool. Contributed by Varun Vasudev. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576545)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


> YARN NodeManager with java.util.concurrent.RejectedExecutionException
> -
>
> Key: YARN-1800
> URL: https://issues.apache.org/jira/browse/YARN-1800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Paul Isaychuk
>Assignee: Varun Vasudev
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: apache-yarn-1800.0.patch, apache-yarn-1800.1.patch, 
> yarn-yarn-nodemanager-host-2.log.zip
>
>
> Noticed this on tests running on Apache Hadoop 2.2 cluster
> {code}
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,576 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(651)) - Downloading public 
> rsrc:{ 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
>  1390440627435, FILE, null }
> 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
> at java.lang.Thread.run(Thread.java:662)
> 2014-01-23 01:30:28,577 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye..
> 2014-01-23 01:30:28,596 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> SelectChannelConnector@0.0.0.0:50060
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) - 
> Applications still running : [application_1389742077466_0396]
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa
> {code}



--
Th

[jira] [Commented] (YARN-1824) Make Windows client work with Linux/Unix cluster

2014-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931653#comment-13931653
 ] 

Steve Loughran commented on YARN-1824:
--

# YARN-1565 relates to this a bit, as it proposes the RM actually serving up 
some of this information (e.g. lib path). 
# the env var expansion strategy is going to complicate the 
{{yarn.application.classpath}} value

{code}
  
yarn.application.classpath

  
/etc/hadoop/conf,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*

  
  
{code}

What will change to -something like

{code}
  

  
{{/etc/hadoop/conf,/usr/lib/hadoop/*}}{{/usr/lib/hadoop/lib/*}}{{/usr/lib/hadoop-hdfs/*}}
{{/usr/lib/hadoop-hdfs/lib/*}}{{,/usr/lib/hadoop-yarn/*}}{{/usr/lib/hadoop-yarn/lib/*}}
 {{,/usr/lib/hadoop-mapreduce/*}}{{/usr/lib/hadoop-mapreduce/lib/}}*

{code}

If this is the case, whitespace between braces should be stripped so it's 
easier to lay out paths.

I'd also like to see, in the absence of YARN-1565, the ability to inject the 
value of {{yarn.application.classpath}} which the RM knows about. Because today 
the clients need to know it, and if you get it wrong you get meaningless errors 
that are v. hard to track down. 

FWIW, I think in Ant we just let apps use : or , as separators, and just sort 
it out, with some special handling for the {{C:/lib/java}} syntax.



> Make Windows client work with Linux/Unix cluster
> 
>
> Key: YARN-1824
> URL: https://issues.apache.org/jira/browse/YARN-1824
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1824.1.patch, YARN-1824.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1800) YARN NodeManager with java.util.concurrent.RejectedExecutionException

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931775#comment-13931775
 ] 

Hudson commented on YARN-1800:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1699 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1699/])
YARN-1800. Fixed NodeManager to gracefully handle RejectedExecutionException in 
the public-localizer thread-pool. Contributed by Varun Vasudev. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576545)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


> YARN NodeManager with java.util.concurrent.RejectedExecutionException
> -
>
> Key: YARN-1800
> URL: https://issues.apache.org/jira/browse/YARN-1800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Paul Isaychuk
>Assignee: Varun Vasudev
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: apache-yarn-1800.0.patch, apache-yarn-1800.1.patch, 
> yarn-yarn-nodemanager-host-2.log.zip
>
>
> Noticed this on tests running on Apache Hadoop 2.2 cluster
> {code}
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,576 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(651)) - Downloading public 
> rsrc:{ 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
>  1390440627435, FILE, null }
> 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
> at java.lang.Thread.run(Thread.java:662)
> 2014-01-23 01:30:28,577 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye..
> 2014-01-23 01:30:28,596 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> SelectChannelConnector@0.0.0.0:50060
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) - 
> Applications still running : [application_1389742077466_0396]
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa
> {code}



--

[jira] [Commented] (YARN-1821) NPE on registerNodeManager if the request has containers for UnmanagedAMs

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931774#comment-13931774
 ] 

Hudson commented on YARN-1821:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1699 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1699/])
YARN-1821. NPE on registerNodeManager if the request has containers for 
UnmanagedAMs (kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576525)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


> NPE on registerNodeManager if the request has containers for UnmanagedAMs
> -
>
> Key: YARN-1821
> URL: https://issues.apache.org/jira/browse/YARN-1821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.4.0
>
> Attachments: yarn-1821-1.patch
>
>
> On RM restart (or failover), NM re-registers with the RM. If it was running 
> containers for Unmanaged AMs, it runs into the following NPE:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:213)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1825) Need ability to create ACL for viewing the listing of a job queue

2014-03-12 Thread Alex Nastetsky (JIRA)
Alex Nastetsky created YARN-1825:


 Summary: Need ability to create ACL for viewing the listing of a 
job queue
 Key: YARN-1825
 URL: https://issues.apache.org/jira/browse/YARN-1825
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.1-beta
 Environment: HDP-2.0.6
Reporter: Alex Nastetsky


We need a way to restrict the ability to see the list of jobs in a queue.
Currently, the only queue ACLs available are acl_administer_queues, 
acl_administer_queue and acl_submit_applications, none of which control the 
ability to the see list of jobs in a queue.
This requirement is necessary because the Job History server provides a lot of 
potentially sensitive info in the job listing alone, e.g. a Hive map reduce job 
shows the query itself in the Name column.
The only thing we have currently available is the ability to filter by queue 
name, but there is no way to enforce a filter.

NOTE: This is a duplicate of MAPREDUCE-5750, not sure if this should go into MR 
or YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1444) RM crashes when node resource request sent without corresponding off-switch request

2014-03-12 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931808#comment-13931808
 ] 

Arun C Murthy commented on YARN-1444:
-

Looks like YARN-1591 is tracking TestResourceTrackerService. /cc [~vinodkv]

I'll go ahead and commit this.

> RM crashes when node resource request sent without corresponding off-switch 
> request
> ---
>
> Key: YARN-1444
> URL: https://issues.apache.org/jira/browse/YARN-1444
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client, resourcemanager
>Reporter: Robert Grandl
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: yarn-1444.ver1.patch, yarn-1444.ver2.patch
>
>
> I have tried to force reducers to execute on certain nodes. What I did is I 
> changed for reduce tasks, the 
> RMContainerRequestor#addResourceRequest(req.priority, ResourceRequest.ANY, 
> req.capability) to RMContainerRequestor#addResourceRequest(req.priority, 
> HOST_NAME, req.capability). 
> However, this change lead to RM crashes when reducers needs to be assigned 
> with the following exception:
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:841)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:640)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:554)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:695)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:739)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:549)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1444) RM crashes when node resource request sent without corresponding off-switch request

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931827#comment-13931827
 ] 

Hudson commented on YARN-1444:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5310 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5310/])
YARN-1444. Fix CapacityScheduler to deal with cases where applications specify 
host/rack requests without off-switch request. Contributed by Wangda Tan. 
(acmurthy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576751)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


> RM crashes when node resource request sent without corresponding off-switch 
> request
> ---
>
> Key: YARN-1444
> URL: https://issues.apache.org/jira/browse/YARN-1444
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client, resourcemanager
>Reporter: Robert Grandl
>Assignee: Wangda Tan
>Priority: Blocker
> Fix For: 2.4.0
>
> Attachments: yarn-1444.ver1.patch, yarn-1444.ver2.patch
>
>
> I have tried to force reducers to execute on certain nodes. What I did is I 
> changed for reduce tasks, the 
> RMContainerRequestor#addResourceRequest(req.priority, ResourceRequest.ANY, 
> req.capability) to RMContainerRequestor#addResourceRequest(req.priority, 
> HOST_NAME, req.capability). 
> However, this change lead to RM crashes when reducers needs to be assigned 
> with the following exception:
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:841)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:640)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:554)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:695)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:739)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:549)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1821) NPE on registerNodeManager if the request has containers for UnmanagedAMs

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931838#comment-13931838
 ] 

Hudson commented on YARN-1821:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1724 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1724/])
YARN-1821. NPE on registerNodeManager if the request has containers for 
UnmanagedAMs (kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576525)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


> NPE on registerNodeManager if the request has containers for UnmanagedAMs
> -
>
> Key: YARN-1821
> URL: https://issues.apache.org/jira/browse/YARN-1821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.4.0
>
> Attachments: yarn-1821-1.patch
>
>
> On RM restart (or failover), NM re-registers with the RM. If it was running 
> containers for Unmanaged AMs, it runs into the following NPE:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:213)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1800) YARN NodeManager with java.util.concurrent.RejectedExecutionException

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931839#comment-13931839
 ] 

Hudson commented on YARN-1800:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1724 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1724/])
YARN-1800. Fixed NodeManager to gracefully handle RejectedExecutionException in 
the public-localizer thread-pool. Contributed by Varun Vasudev. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576545)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


> YARN NodeManager with java.util.concurrent.RejectedExecutionException
> -
>
> Key: YARN-1800
> URL: https://issues.apache.org/jira/browse/YARN-1800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Paul Isaychuk
>Assignee: Varun Vasudev
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: apache-yarn-1800.0.patch, apache-yarn-1800.1.patch, 
> yarn-yarn-nodemanager-host-2.log.zip
>
>
> Noticed this on tests running on Apache Hadoop 2.2 cluster
> {code}
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.splitmetainfo
>  transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.split 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,575 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(196)) - Resource 
> hdfs://colo-2:8020/user/fertrist/.staging/job_1389742077466_0396/job.xml 
> transitioned from INIT to DOWNLOADING
> 2014-01-23 01:30:28,576 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(651)) - Downloading public 
> rsrc:{ 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/605-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
>  1390440627435, FILE, null }
> 2014-01-23 01:30:28,576 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(141)) - Error in dispatcher thread
> java.util.concurrent.RejectedExecutionException
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:678)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:583)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:525)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
> at java.lang.Thread.run(Thread.java:662)
> 2014-01-23 01:30:28,577 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(144)) - Exiting, bbye..
> 2014-01-23 01:30:28,596 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> SelectChannelConnector@0.0.0.0:50060
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(328)) - 
> Applications still running : [application_1389742077466_0396]
> 2014-01-23 01:30:28,597 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(336)) - Wa
> {co

[jira] [Commented] (YARN-1444) RM crashes when node resource request sent without corresponding off-switch request

2014-03-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931900#comment-13931900
 ] 

Wangda Tan commented on YARN-1444:
--

Thanks [~acmurthy] for reviewing about this.

> RM crashes when node resource request sent without corresponding off-switch 
> request
> ---
>
> Key: YARN-1444
> URL: https://issues.apache.org/jira/browse/YARN-1444
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client, resourcemanager
>Reporter: Robert Grandl
>Assignee: Wangda Tan
>Priority: Blocker
> Fix For: 2.4.0
>
> Attachments: yarn-1444.ver1.patch, yarn-1444.ver2.patch
>
>
> I have tried to force reducers to execute on certain nodes. What I did is I 
> changed for reduce tasks, the 
> RMContainerRequestor#addResourceRequest(req.priority, ResourceRequest.ANY, 
> req.capability) to RMContainerRequestor#addResourceRequest(req.priority, 
> HOST_NAME, req.capability). 
> However, this change lead to RM crashes when reducers needs to be assigned 
> with the following exception:
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:841)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:640)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:554)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:695)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:739)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:549)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1658) Webservice should redirect to active RM when HA is enabled.

2014-03-12 Thread Cindy Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931923#comment-13931923
 ] 

Cindy Li commented on YARN-1658:


The test failure is irrelevant. 

> Webservice should redirect to active RM when HA is enabled.
> ---
>
> Key: YARN-1658
> URL: https://issues.apache.org/jira/browse/YARN-1658
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Cindy Li
>Assignee: Cindy Li
>  Labels: YARN
> Attachments: YARN1658.1.patch, YARN1658.patch
>
>
> When HA is enabled, web service to standby RM should be redirected to the 
> active RM. This is a related Jira to YARN-1525.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1826) TestDirectoryCollection intermittent failures

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932007#comment-13932007
 ] 

Tsuyoshi OZAWA commented on YARN-1826:
--

Log is as follows:
{code}
testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection)
  Time elapsed: 0.267 sec  <<< FAILURE!
java.lang.AssertionError: local dir parent not created with proper permissions 
expected: but was:
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104)
{code}

> TestDirectoryCollection intermittent failures
> -
>
> Key: YARN-1826
> URL: https://issues.apache.org/jira/browse/YARN-1826
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tsuyoshi OZAWA
>
> testCreateDirectories fails intermittently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1826) TestDirectoryCollection intermittent failures

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created YARN-1826:


 Summary: TestDirectoryCollection intermittent failures
 Key: YARN-1826
 URL: https://issues.apache.org/jira/browse/YARN-1826
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA


testCreateDirectories fails intermittently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1506) Replace set resource change on RMNode/SchedulerNode directly with event notification.

2014-03-12 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated YARN-1506:


Priority: Major  (was: Blocker)

> Replace set resource change on RMNode/SchedulerNode directly with event 
> notification.
> -
>
> Key: YARN-1506
> URL: https://issues.apache.org/jira/browse/YARN-1506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, scheduler
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-1506-v1.patch, YARN-1506-v2.patch, 
> YARN-1506-v3.patch, YARN-1506-v4.patch, YARN-1506-v5.patch, 
> YARN-1506-v6.patch, YARN-1506-v7.patch
>
>
> According to Vinod's comments on YARN-312 
> (https://issues.apache.org/jira/browse/YARN-312?focusedCommentId=13846087&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13846087),
>  we should replace RMNode.setResourceOption() with some resource change event.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1798) TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932013#comment-13932013
 ] 

Tsuyoshi OZAWA commented on YARN-1798:
--

Additional log about TestNodeManagerShutdown.testKillContainersOnShutdown 
failure:

{code}
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.783 sec <<< 
FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown
testKillContainersOnShutdown(org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown)
  Time elapsed: 6.456 sec  <<< FAILURE!
junit.framework.AssertionFailedError: Did not find sigterm message
at junit.framework.Assert.fail(Assert.java:50)
at junit.framework.Assert.assertTrue(Assert.java:20)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.testKillContainersOnShutdown(TestNodeManagerShutdown.java:153)
{code}

> TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, 
> TestNodeStatusUpdater fails on Linux
> -
>
> Key: YARN-1798
> URL: https://issues.apache.org/jira/browse/YARN-1798
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tsuyoshi OZAWA
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1812) Job stays in PREP state for log time after RM Restarts

2014-03-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1812:
--

Attachment: YARN-1812.3.patch

Same patch as before with the typo fixed.

> Job stays in PREP state for log time after RM Restarts
> --
>
> Key: YARN-1812
> URL: https://issues.apache.org/jira/browse/YARN-1812
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Jian He
> Attachments: YARN-1812.1.patch, YARN-1812.2.patch, YARN-1812.3.patch
>
>
> Steps followed:
> 1) start a sort job with 80 maps and 5 reducers
> 2) restart Resource manager when 60 maps and 0 reducers are finished
> 3) Wait for job to come out of PREP state.
> The job does not come out of PREP state after 7-8 mins.
> After waiting for 7-8 mins, test kills the job.
> However, Sort job should not take this long time to come out of PREP state



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1474) Make schedulers services

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.6.patch

Rebased on trunk. Basic approach is same as YARN-1474.5.patch.

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1812) Job stays in PREP state for log time after RM Restarts

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932087#comment-13932087
 ] 

Hadoop QA commented on YARN-1812:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634206/YARN-1812.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3332//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3332//console

This message is automatically generated.

> Job stays in PREP state for log time after RM Restarts
> --
>
> Key: YARN-1812
> URL: https://issues.apache.org/jira/browse/YARN-1812
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Jian He
> Attachments: YARN-1812.1.patch, YARN-1812.2.patch, YARN-1812.3.patch
>
>
> Steps followed:
> 1) start a sort job with 80 maps and 5 reducers
> 2) restart Resource manager when 60 maps and 0 reducers are finished
> 3) Wait for job to come out of PREP state.
> The job does not come out of PREP state after 7-8 mins.
> After waiting for 7-8 mins, test kills the job.
> However, Sort job should not take this long time to come out of PREP state



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1812) Job stays in PREP state for log time after RM Restarts

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932093#comment-13932093
 ] 

Tsuyoshi OZAWA commented on YARN-1812:
--

TestResourceTrackerService's failure is filed as YARN-1591.

> Job stays in PREP state for log time after RM Restarts
> --
>
> Key: YARN-1812
> URL: https://issues.apache.org/jira/browse/YARN-1812
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Jian He
> Attachments: YARN-1812.1.patch, YARN-1812.2.patch, YARN-1812.3.patch
>
>
> Steps followed:
> 1) start a sort job with 80 maps and 5 reducers
> 2) restart Resource manager when 60 maps and 0 reducers are finished
> 3) Wait for job to come out of PREP state.
> The job does not come out of PREP state after 7-8 mins.
> After waiting for 7-8 mins, test kills the job.
> However, Sort job should not take this long time to come out of PREP state



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932095#comment-13932095
 ] 

Tsuyoshi OZAWA commented on YARN-1591:
--

[~jianhe], can you take a look at latest patch?

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932097#comment-13932097
 ] 

Jian He commented on YARN-1591:
---

patch look good,  committing it.

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1816:
--

Attachment: YARN-1816.1.patch

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:36,993|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Track

[jira] [Commented] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932105#comment-13932105
 ] 

Jian He commented on YARN-1816:
---

The log shows on recovery, Attempt shows finished, but application stucks at 
accepted state.
The reason is RMApp fails to handle the AttemptFinished event, when attempt is 
recovering and send the AttemptFinished event back to RMApp.

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
>

[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932112#comment-13932112
 ] 

Tsuyoshi OZAWA commented on YARN-1591:
--

Thanks, Jian!

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1370) Fair scheduler to re-populate container allocation state

2014-03-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1370:
---

Assignee: Anubhav Dhoot  (was: Karthik Kambatla)

> Fair scheduler to re-populate container allocation state
> 
>
> Key: YARN-1370
> URL: https://issues.apache.org/jira/browse/YARN-1370
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
>
> YARN-1367 and YARN-1368 enable the NM to tell the RM about currently running 
> containers and the RM will pass this information to the schedulers along with 
> the node information. The schedulers are currently already informed about 
> previously running apps when the app data is recovered from the store. The 
> scheduler is expected to be able to repopulate its allocation state from the 
> above 2 sources of information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1816:
--

Fix Version/s: 2.4.0

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:36,993|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress  

[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932150#comment-13932150
 ] 

Jian He commented on YARN-1591:
---

Just ran the tests locally myself, I've seen some redundant shutdown exception, 
probably because DefaultMetricsSystem.shutdown() immediately followed by 
rm.stop() in tearDown method,  can you take a deeper look ? Thanks! 

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932160#comment-13932160
 ] 

Tsuyoshi OZAWA commented on YARN-1591:
--

Oh, I'm checking it. Thank you for the pointing.

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1370) Fair scheduler to re-populate container allocation state

2014-03-12 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932175#comment-13932175
 ] 

Anubhav Dhoot commented on YARN-1370:
-

Ack. Will dig deeper on this


On Wed, Mar 12, 2014 at 11:30 AM, Karthik Kambatla (JIRA)



> Fair scheduler to re-populate container allocation state
> 
>
> Key: YARN-1370
> URL: https://issues.apache.org/jira/browse/YARN-1370
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
>
> YARN-1367 and YARN-1368 enable the NM to tell the RM about currently running 
> containers and the RM will pass this information to the schedulers along with 
> the node information. The schedulers are currently already informed about 
> previously running apps when the app data is recovered from the store. The 
> scheduler is expected to be able to repopulate its allocation state from the 
> above 2 sources of information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1824) Make Windows client work with Linux/Unix cluster

2014-03-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932189#comment-13932189
 ] 

Jian He commented on YARN-1824:
---

If I understand you correctly,  the following is read from xml file and will 
first be parsed and put into a in-memory map,
{code}
  
yarn.application.classpath

  
/etc/hadoop/conf,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*

  
{code}
  after that the constant  will be used to concatenate 
the paths as  following and sent across  to NM. NM will replace the 
 with its own separator.
{code}
 
/etc/hadoop/conf/usr/lib/hadoop/*/usr/lib/hadoop/lib/*/usr/lib/hadoop-hdfs/*/usr/lib/hadoop-hdfs/lib/*
 .
{code}
The curly brackets {{VAR}} is used as env expansion marker for indicating env 
variables.

> Make Windows client work with Linux/Unix cluster
> 
>
> Key: YARN-1824
> URL: https://issues.apache.org/jira/browse/YARN-1824
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1824.1.patch, YARN-1824.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1816:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-128

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:36,993|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State 

[jira] [Updated] (YARN-1812) Job stays in PREP state for log time after RM Restarts

2014-03-12 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1812:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-128

> Job stays in PREP state for log time after RM Restarts
> --
>
> Key: YARN-1812
> URL: https://issues.apache.org/jira/browse/YARN-1812
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yesha Vora
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1812.1.patch, YARN-1812.2.patch, YARN-1812.3.patch
>
>
> Steps followed:
> 1) start a sort job with 80 maps and 5 reducers
> 2) restart Resource manager when 60 maps and 0 reducers are finished
> 3) Wait for job to come out of PREP state.
> The job does not come out of PREP state after 7-8 mins.
> After waiting for 7-8 mins, test kills the job.
> However, Sort job should not take this long time to come out of PREP state



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource

2014-03-12 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932203#comment-13932203
 ] 

Chris Douglas commented on YARN-1771:
-

I just skimmed the patch, but it lgtm. The LoadingCache impl is very clean, and 
only caching over the course of a container localization relieves one of any 
practical responsibility to limit the cache size (that said, might as well add 
something fixed). Only minor, optional nits: If a path is invalid/inaccessible, 
it might make sense to memoize the failure, also. {{FSDownload::isPublic}} can 
be package-private (and annotated w/ {{\@VisibleForTesting}} for the unit test, 
rather than public.

> many getFileStatus calls made from node manager for localizing a public 
> distributed cache resource
> --
>
> Key: YARN-1771
> URL: https://issues.apache.org/jira/browse/YARN-1771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>Priority: Critical
> Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch
>
>
> We're observing that the getFileStatus calls are putting a fair amount of 
> load on the name node as part of checking the public-ness for localizing a 
> resource that belong in the public cache.
> We see 7 getFileStatus calls made for each of these resource. We should look 
> into reducing the number of calls to the name node. One example:
> {noformat}
> 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   src=/tmp ...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   src=/...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,355 INFO audit: ... cmd=open  
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1812) Job stays in PREP state for log time after RM Restarts

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932225#comment-13932225
 ] 

Hudson commented on YARN-1812:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5311 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5311/])
YARN-1812. Fixed ResourceManager to synchrously renew tokens after recovery and 
thus recover app itself synchronously and avoid races with resyncing 
NodeManagers. Contributed by Jian He. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576843)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/RMHATestBase.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java


> Job stays in PREP state for log time after RM Restarts
> --
>
> Key: YARN-1812
> URL: https://issues.apache.org/jira/browse/YARN-1812
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yesha Vora
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1812.1.patch, YARN-1812.2.patch, YARN-1812.3.patch
>
>
> Steps followed:
> 1) start a sort job with 80 maps and 5 reducers
> 2) restart Resource manager when 60 maps and 0 reducers are finished
> 3) Wait for job to come out of PREP state.
> The job does not come out of PREP state after 7-8 mins.
> After waiting for 7-8 mins, test kills the job.
> However, Sort job should not take this long time to come out of PREP state



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1815) RM should recover only Managed AMs

2014-03-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932232#comment-13932232
 ] 

Karthik Kambatla commented on YARN-1815:


bq. If so, we need to make sure the App state moves to FAILED for apps with 
unmanaged AMs after RM restart.
Good catch, Vinod. Looking into this. 

> RM should recover only Managed AMs
> --
>
> Key: YARN-1815
> URL: https://issues.apache.org/jira/browse/YARN-1815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: yarn-1815-1.patch
>
>
> RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1813) Better error message for "yarn logs" when permission denied

2014-03-12 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932252#comment-13932252
 ] 

Andrew Wang commented on YARN-1813:
---

Maybe we can mock an appropriate AbstractFileSystem?

> Better error message for "yarn logs" when permission denied
> ---
>
> Key: YARN-1813
> URL: https://issues.apache.org/jira/browse/YARN-1813
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.3.0
>Reporter: Andrew Wang
>Assignee: Tsuyoshi OZAWA
>Priority: Minor
> Attachments: YARN-1813.1.patch
>
>
> I ran some MR jobs as the "hdfs" user, and then forgot to sudo -u when 
> grabbing the logs. "yarn logs" prints an error message like the following:
> {noformat}
> [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
> 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
> a2402.halxg.cloudera.com/10.20.212.10:8032
> Logs not available at 
> /tmp/logs/andrew.wang/logs/application_1394482121761_0010
> Log aggregation has not completed or is not enabled.
> {noformat}
> It'd be nicer if it said "Permission denied" or "AccessControlException" or 
> something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932272#comment-13932272
 ] 

Hadoop QA commented on YARN-1816:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634215/YARN-1816.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build///testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build///console

This message is automatically generated.

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO

[jira] [Commented] (YARN-1474) Make schedulers services

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932280#comment-13932280
 ] 

Hadoop QA commented on YARN-1474:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634209/YARN-1474.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 10 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3334//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3334//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3334//console

This message is automatically generated.

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932282#comment-13932282
 ] 

Jian He commented on YARN-1816:
---

The patch changed RMApp to handle  AttemptFinished event at Accepted state 
which may  occur on recovery, 
Also made one more change to skip adding the app into scheduler if the final 
state of last attempt is not null, meaning the attempt was already completed.

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application

[jira] [Updated] (YARN-1811) RM HA: AM link broken if the AM is on nodes other than RM

2014-03-12 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-1811:


Attachment: YARN-1811.patch

New patch addresses Vinod's comments and adds/updates the tests accordingly.  I 
also tried it out against RM HA and against a single RM.

> RM HA: AM link broken if the AM is on nodes other than RM
> -
>
> Key: YARN-1811
> URL: https://issues.apache.org/jira/browse/YARN-1811
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-1811.patch, YARN-1811.patch, YARN-1811.patch
>
>
> When using RM HA, if you click on the "Application Master" link in the RM web 
> UI while the job is running, you get an Error 500:



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1815) RM should recover only Managed AMs

2014-03-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1815:
---

Attachment: yarn-1815-2.patch

Updated patch sets the final state of Umanaged AMs on recovery to FAILED 
directly. 

> RM should recover only Managed AMs
> --
>
> Key: YARN-1815
> URL: https://issues.apache.org/jira/browse/YARN-1815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: yarn-1815-1.patch, yarn-1815-2.patch
>
>
> RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1813) Better error message for "yarn logs" when permission denied

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932315#comment-13932315
 ] 

Tsuyoshi OZAWA commented on YARN-1813:
--

I see, trying it.

> Better error message for "yarn logs" when permission denied
> ---
>
> Key: YARN-1813
> URL: https://issues.apache.org/jira/browse/YARN-1813
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.3.0
>Reporter: Andrew Wang
>Assignee: Tsuyoshi OZAWA
>Priority: Minor
> Attachments: YARN-1813.1.patch
>
>
> I ran some MR jobs as the "hdfs" user, and then forgot to sudo -u when 
> grabbing the logs. "yarn logs" prints an error message like the following:
> {noformat}
> [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
> 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
> a2402.halxg.cloudera.com/10.20.212.10:8032
> Logs not available at 
> /tmp/logs/andrew.wang/logs/application_1394482121761_0010
> Log aggregation has not completed or is not enabled.
> {noformat}
> It'd be nicer if it said "Permission denied" or "AccessControlException" or 
> something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1512) Enhance CS to decouple scheduling from node heartbeats

2014-03-12 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated YARN-1512:


Attachment: YARN-1512.patch

Updated patch.

> Enhance CS to decouple scheduling from node heartbeats
> --
>
> Key: YARN-1512
> URL: https://issues.apache.org/jira/browse/YARN-1512
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Attachments: YARN-1512.patch, YARN-1512.patch
>
>
> Enhance CS to decouple scheduling from node heartbeats; a prototype has 
> improved latency significantly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1512) Enhance CS to decouple scheduling from node heartbeats

2014-03-12 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932324#comment-13932324
 ] 

Arun C Murthy commented on YARN-1512:
-

Ready for review; with this patch I've benchmarked <1ms per allocation on a 
cluster now. 

> Enhance CS to decouple scheduling from node heartbeats
> --
>
> Key: YARN-1512
> URL: https://issues.apache.org/jira/browse/YARN-1512
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Attachments: YARN-1512.patch, YARN-1512.patch
>
>
> Enhance CS to decouple scheduling from node heartbeats; a prototype has 
> improved latency significantly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.1.patch

Uploaded new patch.
{quote}
DirectoryCollection: can you put the block where you create and delete a 
random directory inside a dir.exists() check? We don't want to create-delete a 
directory that already exists but matches with our random string - very 
unlikely but not impossible.
{quote}
Fixed. The dir check is now its own function with the exists check.

{quote}
ResourceLocalizationService (RLS): What happens to disks that become good 
after service-init? We don't create the top level directories there. Depending 
on our assumptions in the code in the remaining NM subsystem, this may or may 
not lead to bad bugs. Should we permanently exclude bad-disks found during 
initializing?
Similary in RLS, service-init, we cleanUpLocalDir() to delete old files, If 
disks become good again, we will have unclean disks. And depending on our 
assumptions, we may or may not run into issues. For e.g, files 'leaked' like 
that may never get deleted.
{quote}
Fixed. Local and log dirs undergo a check before use to ensure that they have 
been setup correctly.

{quote}
Add comments to all the tests describing what is being tested
{quote}
Fixed

{quote}
Add more inline comments for each test-block, say for e.g. "changing a disk 
to be bad" before a blocker where you change permissions. For readability.
{quote}
Fixed

{quote}
In all the tests where you sleep for a time more than disk-checker 
frequency, it may or may not pass the test depending on the underlying thread 
scheduling. Instead of that, you should explicitly call 
LocalDirsHandlerService.checkDirs()
{quote}
Fixed, used mocks of the LocalDirsHandlerService removing the timing issue.

{quote}
TestResourceLocalizationService.testFailedDirsResourceRelease()
Nonstandard formatting in method declaration
There is a bit of code about creating container-dirs. Can we reuse some 
of it from ContainerLocalizer?
{quote}
Fixed the non-standard formatting. The ContainerLocalizer code creates only the 
usercache(we need the filecache and the nmPrivate dirs as well).

{quote}
TestNonAggregatingLogHandler
In the existing test-case, you have "actually create the dirs". Why is 
that needed?
{quote}
Fixed. Used mocking to remove requirement.

{quote}
Can we reuse any code in this test with what exists in 
TestLogAggregationService? Seems to me that they both should mostly be the same.
{quote}
Fixed. Shared code moved into functions.

{quote}
TestDirectoryCollection.testFailedDirPassingCheck -> 
testFailedDisksBecomingGoodAgain
{quote}
Fixed.



> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: (was: apache-yarn-90.1.patch)

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.1.patch

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1816) Succeeded application remains in accepted

2014-03-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932392#comment-13932392
 ] 

Vinod Kumar Vavilapalli commented on YARN-1816:
---

Looks good. +1. Checking this in.

> Succeeded application remains in accepted
> -
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:36,993|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type 

[jira] [Updated] (YARN-1816) Succeeded application remains in accepted after RM restart

2014-03-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1816:
--

Summary: Succeeded application remains in accepted after RM restart  (was: 
Succeeded application remains in accepted)

> Succeeded application remains in accepted after RM restart
> --
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:09:05,729|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:35,879|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:36,951|beaver.machine|INFO|14/03/10 18:09:36 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:36,992|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:36,993|beaver.machine|INFO|Applicat

[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store

2014-03-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932412#comment-13932412
 ] 

Zhijie Shen commented on YARN-1717:
---

Thanks for the patch, Billie! Here're some more comments:

1. For the newly added configurations, add the corresponding ones in 
yarn-default.xml.

2. Should these aging mechanism related configs have a leveldb section in the 
config name? Because they're only related to the leveldb impl.

3. Let's don't delete super.serviceInit(conf); Put it at last?
{code}
-super.serviceInit(conf);
+
+if (conf.getBoolean(YarnConfiguration.TIMELINE_SERVICE_TTL_ENABLE, true)) {
+  deletionThread = new DeletionThread(conf);
+  deletionThread.start();
+}
{code}

4. Interrupt, and then join the thread. See the other examples in the project.
{code}
+  deletionThread.interrupt();
+  while (deletionThread.isAlive()) {
+try {
+  LOG.info("Waiting for deletion thread to complete its current " +
+  "action");
+  Thread.sleep(1000);
+} catch (InterruptedException e) {}
+  }
+}
{code}

5. It seems not necessary to refactor "getEntity" into two methods, doesn't it?
{code}
   @Override
   public TimelineEntity getEntity(String entityId, String entityType,
   EnumSet fields) throws IOException {
{code}
{code}
+  private EntityWithReverseRelatedEntities getEntity(String entityId,
+  String entityType, EnumSet fields, byte[] revStartTime)
+  throws IOException {
{code}

6. There're 4 warnings in the new LevedbTimelineStore.java. Would you please 
fix them?

7. In discardOldEntities, if one IOException happens, is it good to move on 
with the following discarding operations?

> Enable offline deletion of entries in leveldb timeline store
> 
>
> Key: YARN-1717
> URL: https://issues.apache.org/jira/browse/YARN-1717
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: YARN-1717.1.patch, YARN-1717.2.patch, YARN-1717.3.patch, 
> YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, 
> YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch
>
>
> The leveldb timeline store implementation needs the following:
> * better documentation of its internal structures
> * internal changes to enable deleting entities
> ** never overwrite existing primary filter entries
> ** add hidden reverse pointers to related entities



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1815) RM should recover only Managed AMs

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932419#comment-13932419
 ] 

Hadoop QA commented on YARN-1815:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634246/yarn-1815-2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3335//console

This message is automatically generated.

> RM should recover only Managed AMs
> --
>
> Key: YARN-1815
> URL: https://issues.apache.org/jira/browse/YARN-1815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: yarn-1815-1.patch, yarn-1815-2.patch
>
>
> RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1812) Job stays in PREP state for long time after RM Restarts

2014-03-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1812:
---

Summary: Job stays in PREP state for long time after RM Restarts  (was: Job 
stays in PREP state for log time after RM Restarts)

> Job stays in PREP state for long time after RM Restarts
> ---
>
> Key: YARN-1812
> URL: https://issues.apache.org/jira/browse/YARN-1812
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yesha Vora
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1812.1.patch, YARN-1812.2.patch, YARN-1812.3.patch
>
>
> Steps followed:
> 1) start a sort job with 80 maps and 5 reducers
> 2) restart Resource manager when 60 maps and 0 reducers are finished
> 3) Wait for job to come out of PREP state.
> The job does not come out of PREP state after 7-8 mins.
> After waiting for 7-8 mins, test kills the job.
> However, Sort job should not take this long time to come out of PREP state



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1717) Enable offline deletion of entries in leveldb timeline store

2014-03-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932437#comment-13932437
 ] 

Zhijie Shen commented on YARN-1717:
---

After the aging mechanism works, there may be some use cases that are affected. 
I can come up with two:

1. Say I have long running service, and it records all the containers that it 
has launched as an "entity". Since the service is running for such a long time 
that the deletion thread has already discarded the early finished containers' 
entity data. Then, if users want to query by giving entity type "Container", 
the old containers will no longer show up, and users may be misled by 
considering the given list has all containers.

2. Another case is related, but is a bit different. Given the service set all 
its containers is related to the service entity. If I query the entity, I'm 
supposed to get all the containers in the related entity section. However, if 
some containers have been discarded, the section will in fact have an 
incomplete list.

To some extent, it makes sense to discard entity at some granularity. For 
example, w.r.t the generic history information, it should be fine if I delete 
the history file which contains all the information about an application, 
including those of attempts and containers. However, if we only delete parts, 
say containers, users will be confused that the application haven't launched 
any containers.

I'm just thinking it out loudly, in case the aging mechanism will affect some 
usage of timeline service.

> Enable offline deletion of entries in leveldb timeline store
> 
>
> Key: YARN-1717
> URL: https://issues.apache.org/jira/browse/YARN-1717
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: YARN-1717.1.patch, YARN-1717.2.patch, YARN-1717.3.patch, 
> YARN-1717.4.patch, YARN-1717.5.patch, YARN-1717.6-extra.patch, 
> YARN-1717.6.patch, YARN-1717.7.patch, YARN-1717.8.patch, YARN-1717.9.patch
>
>
> The leveldb timeline store implementation needs the following:
> * better documentation of its internal structures
> * internal changes to enable deleting entities
> ** never overwrite existing primary filter entries
> ** add hidden reverse pointers to related entities



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1815) RM should recover only Managed AMs

2014-03-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1815:
---

Attachment: yarn-1815-2.patch

Rebased patch on trunk (previous one had conflicts with YARN-1816). 

> RM should recover only Managed AMs
> --
>
> Key: YARN-1815
> URL: https://issues.apache.org/jira/browse/YARN-1815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: Unmanaged AM recovery.png, yarn-1815-1.patch, 
> yarn-1815-2.patch, yarn-1815-2.patch
>
>
> RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource

2014-03-12 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932441#comment-13932441
 ] 

Sangjin Lee commented on YARN-1771:
---

Good point about making isPublic() package-private. I'll make that change.

I'm also going to make changes to memoize failures. The only slight hesitation 
I have is normally that would be quite rare, but I think it's a good thing to 
have.

I did think about making the stat cache longer-lived. But the complexity of 
managing its size as well as the values getting quite stale dissuaded me from 
it. Let me know if you agree...

I'll post a new patch soon.

> many getFileStatus calls made from node manager for localizing a public 
> distributed cache resource
> --
>
> Key: YARN-1771
> URL: https://issues.apache.org/jira/browse/YARN-1771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>Priority: Critical
> Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch
>
>
> We're observing that the getFileStatus calls are putting a fair amount of 
> load on the name node as part of checking the public-ness for localizing a 
> resource that belong in the public cache.
> We see 7 getFileStatus calls made for each of these resource. We should look 
> into reducing the number of calls to the name node. One example:
> {noformat}
> 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   src=/tmp ...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   src=/...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,355 INFO audit: ... cmd=open  
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1815) RM should recover only Managed AMs

2014-03-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1815:
---

Attachment: Unmanaged AM recovery.png

Attaching screen shot with the fix. 

> RM should recover only Managed AMs
> --
>
> Key: YARN-1815
> URL: https://issues.apache.org/jira/browse/YARN-1815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: Unmanaged AM recovery.png, yarn-1815-1.patch, 
> yarn-1815-2.patch, yarn-1815-2.patch
>
>
> RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1816) Succeeded application remains in accepted after RM restart

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932521#comment-13932521
 ] 

Hudson commented on YARN-1816:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5312 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5312/])
YARN-1816. Fixed ResourceManager to get RMApp correctly handle ATTEMPT_FINISHED 
event at ACCEPTED state that can happen after RM restarts. Contributed by Jian 
He. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576911)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java


> Succeeded application remains in accepted after RM restart
> --
>
> Key: YARN-1816
> URL: https://issues.apache.org/jira/browse/YARN-1816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Jian He
> Fix For: 2.4.0
>
> Attachments: YARN-1816.1.patch
>
>
> {code}
> 2014-03-10 18:07:31,944|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:07:31,945|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:02,125|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:03,198|beaver.machine|INFO|14/03/10 18:08:03 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:03,238|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:03,239|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:03,239|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:08:33,390|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:08:34,437|beaver.machine|INFO|14/03/10 18:08:34 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:08:34,477|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State Final-State Progress 
>Tracking-URL
> 2014-03-10 18:08:34,478|beaver.machine|INFO|application_1394449508064_0008
> test_mapred_ha_multiple_job_nn-rm-1-min-5-jobs_1394449960-4
> MAPREDUCEhrt_qa defaultACCEPTED   
> SUCCEEDED 100% 
> http://hostname:19888/jobhistory/job/job_1394449508064_0008
> 2014-03-10 18:09:04,628|beaver.machine|INFO|RUNNING: /usr/bin/yarn 
> application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING
> 2014-03-10 18:09:05,688|beaver.machine|INFO|14/03/10 18:09:05 INFO 
> client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Total number of applications 
> (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, 
> RUNNING]):1
> 2014-03-10 18:09:05,728|beaver.machine|INFO|Application-Id
> Application-NameApplication-Type  User   Queue
>State F

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932539#comment-13932539
 ] 

Hadoop QA commented on YARN-90:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12634255/apache-yarn-90.1.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3337//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1789) ApplicationSummary does not escape newlines in the app name

2014-03-12 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932541#comment-13932541
 ] 

Jason Lowe commented on YARN-1789:
--

+1 lgtm.  Committing this.

> ApplicationSummary does not escape newlines in the app name
> ---
>
> Key: YARN-1789
> URL: https://issues.apache.org/jira/browse/YARN-1789
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Akira AJISAKA
>Assignee: Tsuyoshi OZAWA
>Priority: Minor
>  Labels: newbie
> Attachments: YARN-1789.1.patch
>
>
> YARN-side of MAPREDUCE-5778.
> ApplicationSummary is not escaping newlines in the app name. This can result 
> in an application summary log entry that spans multiple lines when users are 
> expecting one-app-per-line output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1789) ApplicationSummary does not escape newlines in the app name

2014-03-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932602#comment-13932602
 ] 

Hudson commented on YARN-1789:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5313 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5313/])
YARN-1789. ApplicationSummary does not escape newlines in the app name. 
Contributed by Tsuyoshi OZAWA (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1576960)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java


> ApplicationSummary does not escape newlines in the app name
> ---
>
> Key: YARN-1789
> URL: https://issues.apache.org/jira/browse/YARN-1789
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Akira AJISAKA
>Assignee: Tsuyoshi OZAWA
>Priority: Minor
>  Labels: newbie
> Fix For: 2.4.0
>
> Attachments: YARN-1789.1.patch
>
>
> YARN-side of MAPREDUCE-5778.
> ApplicationSummary is not escaping newlines in the app name. This can result 
> in an application summary log entry that spans multiple lines when users are 
> expecting one-app-per-line output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-1389:


Attachment: YARN-1389-9.patch

Updating latest patch after fixing exceptions.

Thanks,
Mayank

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch, YARN-1389-9.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-03-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932623#comment-13932623
 ] 

Zhijie Shen commented on YARN-1530:
---

I've done some stress test about the timeline service. I've taken two steps:

1. I setup a yarn cluster of 5 nodes (1 master and 4 slaves). I ran 7200 
mapreduce example jobs with tez framework (which will post tez entities to the 
timeline service) in 10 hours. The max concurrent job was 7, because it was not 
a big cluster with 32G memory only, but the real concurrency should be higher 
because of multiple mappers/reducers. The workload was kept almost full in 10 
hours. All the job were succeeded, and the tez entities were stored in the 
timeline store without exceptions. The leveldb based timeline store has grown 
to about 220MB (not very big because of small example jobs).

2. Afterwards, I tested the the concurrent reads/writes together. On the write 
part, I did the same thing as step 1. On the read part, I set up 4 timeline 
query clients, one on each slave node. Each client starts 10 parallel threads 
to send requests to the timeline service for 10 hours as well. Each client sent 
more than 6 million queries during the 10 hours with the combination of three 
RESTful APIs (24+ million total for 4 clients). In general, the timeline 
service was still working well. I just saw one query  was responded with not 
found exception, and some other JVM warnings. The query of get entities takes 
0.X on average while the query of get entity/events take 0.0x.

Therefore, the timeline service with leveldb store works so far so good. I'll 
do more stress testing with big entity, and update to you once I've some 
metrics.

> [Umbrella] Store, manage and serve per-framework application-timeline data
> --
>
> Key: YARN-1530
> URL: https://issues.apache.org/jira/browse/YARN-1530
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
> Attachments: application timeline design-20140108.pdf, application 
> timeline design-20140116.pdf, application timeline design-20140130.pdf, 
> application timeline design-20140210.pdf
>
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework 
> data all by itself as YARN doesn't have a common solution. This JIRA attempts 
> to solve the storage, management and serving of per-framework data from 
> various applications, both running and finished. The aim is to change YARN to 
> collect and store data in a generic manner with plugin points for frameworks 
> to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1815) RM should recover only Managed AMs

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932630#comment-13932630
 ] 

Hadoop QA commented on YARN-1815:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634273/yarn-1815-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3338//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3338//console

This message is automatically generated.

> RM should recover only Managed AMs
> --
>
> Key: YARN-1815
> URL: https://issues.apache.org/jira/browse/YARN-1815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: Unmanaged AM recovery.png, yarn-1815-1.patch, 
> yarn-1815-2.patch, yarn-1815-2.patch
>
>
> RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1512) Enhance CS to decouple scheduling from node heartbeats

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932639#comment-13932639
 ] 

Hadoop QA commented on YARN-1512:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634248/YARN-1512.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceManager
  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
  org.apache.hadoop.yarn.server.resourcemanager.TestRM
  
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3336//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3336//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3336//console

This message is automatically generated.

> Enhance CS to decouple scheduling from node heartbeats
> --
>
> Key: YARN-1512
> URL: https://issues.apache.org/jira/browse/YARN-1512
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Attachments: YARN-1512.patch, YARN-1512.patch
>
>
> Enhance CS to decouple scheduling from node heartbeats; a prototype has 
> improved latency significantly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource

2014-03-12 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932651#comment-13932651
 ] 

Chris Douglas commented on YARN-1771:
-

bq. I'm also going to make changes to memoize failures. The only slight 
hesitation I have is normally that would be quite rare, but I think it's a good 
thing to have.

Agreed, I doubt it will have a significant impact, here. In a 
shared/longer-lived cache it might be marginally more useful, but still rare.

bq. I did think about making the stat cache longer-lived. But the complexity of 
managing its size as well as the values getting quite stale dissuaded me from 
it. Let me know if you agree...

*nod* Since the goal is to reduce stress on the NN, deferring that complexity 
until necessary is a good plan.

> many getFileStatus calls made from node manager for localizing a public 
> distributed cache resource
> --
>
> Key: YARN-1771
> URL: https://issues.apache.org/jira/browse/YARN-1771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>Priority: Critical
> Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch
>
>
> We're observing that the getFileStatus calls are putting a fair amount of 
> load on the name node as part of checking the public-ness for localizing a 
> resource that belong in the public cache.
> We see 7 getFileStatus calls made for each of these resource. We should look 
> into reducing the number of calls to the name node. One example:
> {noformat}
> 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   src=/tmp ...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   src=/...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,355 INFO audit: ... cmd=open  
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1827) yarn client fails when RM is killed within 5s of job submission

2014-03-12 Thread Arpit Gupta (JIRA)
Arpit Gupta created YARN-1827:
-

 Summary: yarn client fails when RM is killed within 5s of job 
submission
 Key: YARN-1827
 URL: https://issues.apache.org/jira/browse/YARN-1827
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Arpit Gupta






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1827) yarn client fails when RM is killed within 5s of job submission

2014-03-12 Thread Arpit Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932715#comment-13932715
 ] 

Arpit Gupta commented on YARN-1827:
---

Here is the stack trace we see

{code}
/usr/lib/hadoop/bin/hadoop jar 
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.4.0.2.1.1.0-180.jar 
wordcount "-Dmapreduce.reduce.input.limit=-1" 
/user/hrt_qa/test_yarn_ha/medium_wordcount_input 
/user/hrt_qa/test_yarn_ha/test_mapred_ha_single_job_am-rm
INFO|Initial wait for Service resourcemanager: 5
14/03/12 10:41:34 WARN hdfs.DFSClient: 
dfs.client.test.drop.namenode.response.number is set to 1, this hacked client 
will proactively drop responses
14/03/12 10:41:34 WARN hdfs.DFSClient: 
dfs.client.test.drop.namenode.response.number is set to 1, this hacked client 
will proactively drop responses
14/03/12 10:41:38 INFO input.FileInputFormat: Total input paths to process : 20
14/03/12 10:41:38 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/03/12 10:41:38 INFO lzo.LzoCodec: Successfully loaded & initialized 
native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
INFO|stop resourcemanager
RUNNING: sudo su - -c "/usr/bin/yarn rmadmin -getServiceState rm1" yarn
14/03/12 10:41:38 INFO mapreduce.JobSubmitter: number of splits:180
14/03/12 10:41:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1394620620060_0001
active
oop|INFO|exit code = 0
oop|INFO|Kill service resourcemanager on host host
RUNNING: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null host 
"sudo su - -c \"cat /grid/0/var/run/hadoop/yarn/yarn-yarn-resourcemanager.pid | 
xargs kill -9\" yarn"
Warning: Permanently added 'host,68.142.247.212' (RSA) to the list of known 
hosts.
14/03/12 10:41:39 WARN retry.RetryInvocationHandler: Exception while invoking 
class 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport
 over rm1. Not retrying because the invoked method is not idempotent, and 
unable to determine whether it was invoked
java.io.IOException: Failed on local exception: java.io.EOFException; Host 
Details : local host is: "host":8032;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1359)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy14.getApplicationReport(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:142)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at $Proxy15.getApplicationReport(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:275)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:192)
at 
org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:282)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:289)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at o

[jira] [Updated] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource

2014-03-12 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1771:
--

Attachment: yarn-1771.patch

I have submitted a new patch.

I'm now using Future as the value in the loading cache so that it 
can memoize the negative result. I think that's the best way to achieve this to 
report the original exceptions. I thought about using Optional, but I think 
Future works better in this case (as it can include the value or the 
exception). The code becomes slightly more complicated as a result.

I also moved the code that creates the CacheLoader to FSDownload so the related 
logic is co-located within FSDownload.

Also made isPublic package-private.

> many getFileStatus calls made from node manager for localizing a public 
> distributed cache resource
> --
>
> Key: YARN-1771
> URL: https://issues.apache.org/jira/browse/YARN-1771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>Priority: Critical
> Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, 
> yarn-1771.patch
>
>
> We're observing that the getFileStatus calls are putting a fair amount of 
> load on the name node as part of checking the public-ness for localizing a 
> resource that belong in the public cache.
> We see 7 getFileStatus calls made for each of these resource. We should look 
> into reducing the number of calls to the name node. One example:
> {noformat}
> 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   src=/tmp ...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   src=/...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,355 INFO audit: ... cmd=open  
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1389:
--

Attachment: YARN-1389.10.patch

Mayank, thanks for the new patch. It looks good to me. I did some minor touch 
to remove the used import, simply the code path in YarnClientImpl, and fix some 
remaining exception issues. And I did some local test on a presudo cluster, 
which worked fine. Will commit the patch once Jenkins +1.

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch, YARN-1389-9.patch, YARN-1389.10.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1389:
--

Hadoop Flags: Reviewed

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch, YARN-1389-9.patch, YARN-1389.10.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932806#comment-13932806
 ] 

Hadoop QA commented on YARN-1771:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634327/yarn-1771.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3340//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3340//console

This message is automatically generated.

> many getFileStatus calls made from node manager for localizing a public 
> distributed cache resource
> --
>
> Key: YARN-1771
> URL: https://issues.apache.org/jira/browse/YARN-1771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>Priority: Critical
> Attachments: yarn-1771.patch, yarn-1771.patch, yarn-1771.patch, 
> yarn-1771.patch
>
>
> We're observing that the getFileStatus calls are putting a fair amount of 
> load on the name node as part of checking the public-ness for localizing a 
> resource that belong in the public cache.
> We see 7 getFileStatus calls made for each of these resource. We should look 
> into reducing the number of calls to the name node. One example:
> {noformat}
> 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724 ...
> 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo   src=/tmp ...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   src=/...
> 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo   
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> 2014-02-27 18:07:27,355 INFO audit: ... cmd=open  
> src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932865#comment-13932865
 ] 

Hadoop QA commented on YARN-1389:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634332/YARN-1389.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3341//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3341//console

This message is automatically generated.

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch, YARN-1389-9.patch, YARN-1389.10.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs

2014-03-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1389:
--

Attachment: YARN-1389.11.patch

Upload a new patch, which should fix the findbugs warning.

> ApplicationClientProtocol and ApplicationHistoryProtocol should expose 
> analogous APIs
> -
>
> Key: YARN-1389
> URL: https://issues.apache.org/jira/browse/YARN-1389
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, 
> YARN-1389-4.patch, YARN-1389-5.patch, YARN-1389-6.patch, YARN-1389-7.patch, 
> YARN-1389-8.patch, YARN-1389-9.patch, YARN-1389.10.patch, YARN-1389.11.patch
>
>
> As we plan to have the APIs in ApplicationHistoryProtocol to expose the 
> reports of *finished* application attempts and containers, we should do the 
> same for ApplicationClientProtocol, which will return the reports of 
> *running* attempts and containers.
> Later on, we can improve YarnClient to direct the query of running instance 
> to ApplicationClientProtocol, while that of finished instance to 
> ApplicationHistoryProtocol, making it transparent to the users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1591:
-

Attachment: YARN-1591.3.patch

* Fixed to call DefaultMetricsSystem.shutdown() only when 
ms.getSource("ClusterMetrics") returns null
* Unhandled Exception crashes AsyncDispatcher thread randomly, so fixed it. The 
log is as follows:

{code}
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher.handle(ResourceManager.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher.handle(ResourceManager.java:539)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:744)
{code}

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch, YARN-1591.3.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932890#comment-13932890
 ] 

Tsuyoshi OZAWA commented on YARN-1591:
--

typo:
* s/when ms.getSource("ClusterMetrics") returns null/when 
ms.getSource("ClusterMetrics") returns non null instance/



> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch, YARN-1591.3.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1828) Resource Manager is down when I request job on specific queue.

2014-03-12 Thread jerryjung (JIRA)
jerryjung created YARN-1828:
---

 Summary: Resource Manager is down when I request job on specific 
queue.
 Key: YARN-1828
 URL: https://issues.apache.org/jira/browse/YARN-1828
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.0
 Environment: CDH-5.0.0-beta-2(Hadoop 2.2.0) version
Reporter: jerryjung
Priority: Blocker


Resource Manager is down when I request job on specific queue.

jar 
~/servers/hadoop-2.2.0-cdh5.0.0-beta-2/share/hadoop/mapreduce/hadoop-maprede-examples-2.2.0-cdh5.0.0-beta-2.jar
 wordcount -Dmapreduce.job.queuename=jerry  


YARN log is below. 
==
2014-03-13 14:35:50,391 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
QueueName :::root.jerry
2014-03-13 14:35:50,391 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Queue Information :::null
2014-03-13 14:35:50,392 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:671)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:607)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1102)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:543)
at java.lang.Thread.run(Thread.java:744)
2014-03-13 14:35:50,394 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

I think queue is null.

queue = queueMgr.getLeafQueue(queueName, true); 

But queue info is normal.

==
Queue Name : root.default 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, CurrentCapacity: 
0.0 
==
Queue Name : root.hadoop 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, CurrentCapacity: 
0.0 
==
Queue Name : root.hadoop.hadoop_sub_queue 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, 
CurrentCapacity: 0.0 
==
Queue Name : root.jerry 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, CurrentCapacity: 
0.0 
==
Queue Name : root.jerry.jerry_sub_queue 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, 
CurrentCapacity: 0.0 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1828) Resource Manager is down when I request job on specific queue.

2014-03-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932894#comment-13932894
 ] 

Karthik Kambatla commented on YARN-1828:


I believe this is YARN-1774 - submitting jobs to non-leaf queues would thrown 
an NPE without the fix. Even after the fix, users are not allowed to submit 
jobs to a non-leaf queue. 

> Resource Manager is down when I request job on specific queue.
> --
>
> Key: YARN-1828
> URL: https://issues.apache.org/jira/browse/YARN-1828
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.2.0
> Environment: CDH-5.0.0-beta-2(Hadoop 2.2.0) version
>Reporter: jerryjung
>Priority: Blocker
>
> Resource Manager is down when I request job on specific queue.
> jar 
> ~/servers/hadoop-2.2.0-cdh5.0.0-beta-2/share/hadoop/mapreduce/hadoop-maprede-examples-2.2.0-cdh5.0.0-beta-2.jar
>  wordcount -Dmapreduce.job.queuename=jerry  
> YARN log is below. 
> ==
> 2014-03-13 14:35:50,391 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> QueueName :::root.jerry
> 2014-03-13 14:35:50,391 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Queue Information :::null
> 2014-03-13 14:35:50,392 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:607)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1102)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:543)
>   at java.lang.Thread.run(Thread.java:744)
> 2014-03-13 14:35:50,394 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> please fix the null pointer error about RM down for abnormal request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1828) Resource Manager is down when I request job on specific queue.

2014-03-12 Thread jerryjung (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jerryjung updated YARN-1828:


Description: 
Resource Manager is down when I request job on specific queue.

jar 
~/servers/hadoop-2.2.0-cdh5.0.0-beta-2/share/hadoop/mapreduce/hadoop-maprede-examples-2.2.0-cdh5.0.0-beta-2.jar
 wordcount -Dmapreduce.job.queuename=jerry  


YARN log is below. 
==
2014-03-13 14:35:50,391 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
QueueName :::root.jerry
2014-03-13 14:35:50,391 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Queue Information :::null
2014-03-13 14:35:50,392 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:671)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:607)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1102)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:543)
at java.lang.Thread.run(Thread.java:744)
2014-03-13 14:35:50,394 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

please fix the null pointer error about RM down for abnormal request.


  was:
Resource Manager is down when I request job on specific queue.

jar 
~/servers/hadoop-2.2.0-cdh5.0.0-beta-2/share/hadoop/mapreduce/hadoop-maprede-examples-2.2.0-cdh5.0.0-beta-2.jar
 wordcount -Dmapreduce.job.queuename=jerry  


YARN log is below. 
==
2014-03-13 14:35:50,391 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
QueueName :::root.jerry
2014-03-13 14:35:50,391 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Queue Information :::null
2014-03-13 14:35:50,392 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:671)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:607)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1102)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:543)
at java.lang.Thread.run(Thread.java:744)
2014-03-13 14:35:50,394 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

I think queue is null.

queue = queueMgr.getLeafQueue(queueName, true); 

But queue info is normal.

==
Queue Name : root.default 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, CurrentCapacity: 
0.0 
==
Queue Name : root.hadoop 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, CurrentCapacity: 
0.0 
==
Queue Name : root.hadoop.hadoop_sub_queue 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, 
CurrentCapacity: 0.0 
==
Queue Name : root.jerry 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, CurrentCapacity: 
0.0 
==
Queue Name : root.jerry.jerry_sub_queue 
Queue State : running 
Scheduling Info : Capacity: NaN, MaximumCapacity: UNDEFINED, 
CurrentCapacity: 0.0 


> Resource Manager is down when I request job on specific queue.
> --
>
> Key: YARN-1828
> URL: https://issues.apache.org/jira/browse/YARN-1828
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.2.0
> Environment: CDH-5.0.0-beta-2(Hadoop 2.2.0) version
>Reporter: jerryjung
>Priority: Blocker
>
> Resource Manager is down when I request job on specific queue.
> jar 
> ~/servers/hadoop-2.2.0-cdh5.0.0-beta-2/share/hadoop/mapreduce/hadoop-maprede-exampl

[jira] [Resolved] (YARN-1828) Resource Manager is down when I request job on specific queue.

2014-03-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla resolved YARN-1828.


Resolution: Duplicate

Closing this as a duplicate. We can re-open if we find any pending items. 

> Resource Manager is down when I request job on specific queue.
> --
>
> Key: YARN-1828
> URL: https://issues.apache.org/jira/browse/YARN-1828
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.2.0
> Environment: CDH-5.0.0-beta-2(Hadoop 2.2.0) version
>Reporter: jerryjung
>Priority: Blocker
>
> Resource Manager is down when I request job on specific queue.
> jar 
> ~/servers/hadoop-2.2.0-cdh5.0.0-beta-2/share/hadoop/mapreduce/hadoop-maprede-examples-2.2.0-cdh5.0.0-beta-2.jar
>  wordcount -Dmapreduce.job.queuename=jerry  
> YARN log is below. 
> ==
> 2014-03-13 14:35:50,391 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> QueueName :::root.jerry
> 2014-03-13 14:35:50,391 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Queue Information :::null
> 2014-03-13 14:35:50,392 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:607)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1102)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:543)
>   at java.lang.Thread.run(Thread.java:744)
> 2014-03-13 14:35:50,394 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> please fix the null pointer error about RM down for abnormal request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1591) TestResourceTrackerService fails randomly on trunk

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932908#comment-13932908
 ] 

Hadoop QA commented on YARN-1591:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12634353/YARN-1591.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3343//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3343//console

This message is automatically generated.

> TestResourceTrackerService fails randomly on trunk
> --
>
> Key: YARN-1591
> URL: https://issues.apache.org/jira/browse/YARN-1591
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1591.1.patch, YARN-1591.2.patch, YARN-1591.3.patch
>
>
> As evidenced by Jenkins at 
> https://issues.apache.org/jira/browse/YARN-1041?focusedCommentId=13868621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13868621.
> It's failing randomly on trunk on my local box too 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.2.patch

Fixed issue that caused the patch application to fail.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)