[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739303#comment-13739303
 ] 

Hitesh Shah commented on YARN-1055:
---

[~kkambatl] In case of a network issue where the AM is running but cannot talk 
to the RM or say the NM on which the AM was running goes down, what knob would 
control handling these situations?
 

> Handle app recovery differently for AM failures and RM restart
> --
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-292) ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739304#comment-13739304
 ] 

Hadoop QA commented on YARN-292:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597896/YARN-292.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1710//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1710//console

This message is automatically generated.

> ResourceManager throws ArrayIndexOutOfBoundsException while handling 
> CONTAINER_ALLOCATED for application attempt
> 
>
> Key: YARN-292
> URL: https://issues.apache.org/jira/browse/YARN-292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.1-alpha
>Reporter: Devaraj K
>Assignee: Zhijie Shen
> Attachments: YARN-292.1.patch
>
>
> {code:xml}
> 2012-12-26 08:41:15,030 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
> Calling allocate on removed or non existant application 
> appattempt_1356385141279_49525_01
> 2012-12-26 08:41:15,031 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_ALLOCATED for applicationAttempt 
> application_1356385141279_49525
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>   at java.lang.Thread.run(Thread.java:662)
>  {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-292) ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt

2013-08-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739302#comment-13739302
 ] 

Junping Du commented on YARN-292:
-

Thanks for the patch. Zhijie! Patch looks good to me. However, I would suggest 
to document why at least one container is expected in allocation or adding no 
empty check on getContainers().

> ResourceManager throws ArrayIndexOutOfBoundsException while handling 
> CONTAINER_ALLOCATED for application attempt
> 
>
> Key: YARN-292
> URL: https://issues.apache.org/jira/browse/YARN-292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.1-alpha
>Reporter: Devaraj K
>Assignee: Zhijie Shen
> Attachments: YARN-292.1.patch
>
>
> {code:xml}
> 2012-12-26 08:41:15,030 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
> Calling allocate on removed or non existant application 
> appattempt_1356385141279_49525_01
> 2012-12-26 08:41:15,031 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_ALLOCATED for applicationAttempt 
> application_1356385141279_49525
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>   at java.lang.Thread.run(Thread.java:662)
>  {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-292) ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt

2013-08-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-292:
-

Attachment: YARN-292.1.patch

Created a patch to use ConcurrentHashMap for applications in FifoScheduler and 
FairScheduler, which will make accessing applications thread-safe.

> ResourceManager throws ArrayIndexOutOfBoundsException while handling 
> CONTAINER_ALLOCATED for application attempt
> 
>
> Key: YARN-292
> URL: https://issues.apache.org/jira/browse/YARN-292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.1-alpha
>Reporter: Devaraj K
>Assignee: Zhijie Shen
> Attachments: YARN-292.1.patch
>
>
> {code:xml}
> 2012-12-26 08:41:15,030 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
> Calling allocate on removed or non existant application 
> appattempt_1356385141279_49525_01
> 2012-12-26 08:41:15,031 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_ALLOCATED for applicationAttempt 
> application_1356385141279_49525
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>   at java.lang.Thread.run(Thread.java:662)
>  {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-451) Add more metrics to RM page

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739225#comment-13739225
 ] 

Vinod Kumar Vavilapalli commented on YARN-451:
--

Agreed about having it on the listing page, but that page is already dense. 
Have to do some basic UI design.

Again, like I mentioned, Hadoop-1 was different as number of maps, reduces 
doesn't change after job starts. Whereas in Hadoop-2, memory/cores allocated 
slowly increases over time , so it may or may not be of much use. I am 
ambivalent about adding it.

> Add more metrics to RM page
> ---
>
> Key: YARN-451
> URL: https://issues.apache.org/jira/browse/YARN-451
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Lohit Vijayarenu
>Priority: Minor
>
> ResourceManager webUI shows list of RUNNING applications, but it does not 
> tell which applications are requesting more resource compared to others. With 
> cluster running hundreds of applications at once it would be useful to have 
> some kind of metric to show high-resource usage applications vs low-resource 
> usage ones. At the minimum showing number of containers is good option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager

2013-08-13 Thread prophy Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739219#comment-13739219
 ] 

prophy Yan commented on YARN-993:
-

Jian He i have tryed the patch file in the YARN-513 list,but some error occur 
when i use the patch. my test version is hadoop2.0.5-alpha,so can this patch 
work with this version? thank you.

> job can not recovery after restart resourcemanager
> --
>
> Key: YARN-993
> URL: https://issues.apache.org/jira/browse/YARN-993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.5-alpha
> Environment: CentOS5.3 JDK1.7.0_11
>Reporter: prophy Yan
>Priority: Critical
>
> Recently, i have test the function job recovery in the YARN framework, but it 
> failed.
> first, i run the wordcount example program, and the i kill -9 the 
> resourcemanager process on the server when the wordcount process in map 100%.
> the job will exit with error in minutes.
> second, i restart the resourcemanager on the server by user the 
> 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
> the yarn log says "file not exist!"
> Here is the YARN log:
> 013-07-23 16:05:21,472 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
> launching container Container: [ContainerId: 
> container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, 
> NodeHttpAddress: mv8.mzhen.cn:8042, Resource: , 
> Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
> {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
> 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02
> 2013-07-23 16:05:21,473 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED
> 2013-07-23 16:05:21,925 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED
> 2013-07-23 16:05:21,925 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1374564764970_0001 failed 1 times due to AM Container for 
> appattempt_1374564764970_0001_02 exited with  exitCode: -1000 due to: 
> RemoteTrace:
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
> at 
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
> at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
> at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
>  at LocalTrace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
> File does not exist: 
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> at 
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
> at 
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService

[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739218#comment-13739218
 ] 

Hudson commented on YARN-1060:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4256 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4256/])
YARN-1060. Two tests in TestFairScheduler are missing @Test annotation 
(Niranjan Singh via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1513724)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Two tests in TestFairScheduler are missing @Test annotation
> ---
>
> Key: YARN-1060
> URL: https://issues.apache.org/jira/browse/YARN-1060
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Niranjan Singh
>  Labels: newbie
> Fix For: 2.3.0
>
> Attachments: YARN-1060.patch
>
>
> Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739216#comment-13739216
 ] 

Sandy Ryza commented on YARN-1060:
--

Committed to trunk and branch-2.  Thanks Niranjan!

> Two tests in TestFairScheduler are missing @Test annotation
> ---
>
> Key: YARN-1060
> URL: https://issues.apache.org/jira/browse/YARN-1060
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Niranjan Singh
>  Labels: newbie
> Attachments: YARN-1060.patch
>
>
> Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-451) Add more metrics to RM page

2013-08-13 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739213#comment-13739213
 ] 

Sangjin Lee commented on YARN-451:
--

I think showing this information on the app list page is actually more valuable 
than the per-app page. If this information is present in the app list page, one 
can quickly scan the list and get a sense of which job/app is bigger than 
others in terms of resource consumption. Also, it makes sorting possible.

One could in theory visit individual per-app pages one by one to get the same 
information, but it's so much more useful to have it ready at the overview page 
so one can get that information quickly.

In hadoop 1.0, one could get the same information by looking at the number of 
total mappers and reducers. That way, we got a very good idea on which ones are 
big jobs (and thus need to be monitored more closely) without drilling into any 
of the apps.

> Add more metrics to RM page
> ---
>
> Key: YARN-451
> URL: https://issues.apache.org/jira/browse/YARN-451
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Lohit Vijayarenu
>Priority: Minor
>
> ResourceManager webUI shows list of RUNNING applications, but it does not 
> tell which applications are requesting more resource compared to others. With 
> cluster running hundreds of applications at once it would be useful to have 
> some kind of metric to show high-resource usage applications vs low-resource 
> usage ones. At the minimum showing number of containers is good option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739211#comment-13739211
 ] 

Sandy Ryza commented on YARN-1060:
--

+1

> Two tests in TestFairScheduler are missing @Test annotation
> ---
>
> Key: YARN-1060
> URL: https://issues.apache.org/jira/browse/YARN-1060
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Niranjan Singh
>  Labels: newbie
> Attachments: YARN-1060.patch
>
>
> Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739178#comment-13739178
 ] 

Rohith Sharma K S commented on YARN-1061:
-

Actual issue I got in 5 node cluster (1 RM and 5 NM).It is hard to recure 
scenario for resourcemanager is hang up state in real cluster. 

The same scenario can be simulated manually bringing resourcemanager to hang up 
state with help of linux command "KILL -STOP ". All the NM->RM call 
wait indefinitely. Another case where we can observer indefinite wait is "Add 
new NodeManager when ResouceMangaer is hang up state".



> NodeManager is indefinitely waiting for nodeHeartBeat() response from 
> ResouceManager.
> -
>
> Key: YARN-1061
> URL: https://issues.apache.org/jira/browse/YARN-1061
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
>Reporter: Rohith Sharma K S
>
> It is observed that in one of the scenario, NodeManger is indefinetly waiting 
> for nodeHeartbeat response from ResouceManger where ResouceManger is in 
> hanged up state.
> NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739109#comment-13739109
 ] 

Karthik Kambatla commented on YARN-1055:


>From a YARN-user POV, I see it differently. I want to control whether my app 
>should be recovered on AM/RM failures separately. I might want to recover on 
>RM restart but not on AM failures or viceversa:
# In case of AM failure, user might want to check for user errors and hence not 
recover. But recover in case of RM failures.
# Like Oozie, might want to recover on AM failures but not on RM failures.

Also, is there a disadvantage to having two knobs for the two failures?

> Handle app recovery differently for AM failures and RM restart
> --
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739073#comment-13739073
 ] 

Bikas Saha commented on YARN-1055:
--

Restart on am failure is already determined by the default value of max am 
retries in yarn config. Setting that to 1 will prevent RM from restarting AM's 
on failure. Thus no need for new config. Restart after RM restart is already 
covered by setting max am retries to 1 by the app client on app submission. If 
an app cannot handle this situation it should create its own config and set the 
correct value of 1 on submission. YARN should not add a config IMO. If I 
remember right, this config is being imported from hadoop 1 and the impl of 
this config in hadoop 1 is what RM already does to handle user defined max am 
retries.

> Handle app recovery differently for AM failures and RM restart
> --
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739040#comment-13739040
 ] 

Vinod Kumar Vavilapalli commented on YARN-1055:
---

This is a new issue with Hadoop 2 completely - we've added new failure 
conditions. All the apps handing AM restarts is really the right way forward 
given AMs can now run on random compute nodes that can just fail any time. 
Offline I started engaging some of Pig/Hive community folks. For MR, enough 
work is already done. Oozie needs to follow suit too.

Till work-preserving restart is finished, this is a real pain on RM restarts. 
Which is why I am proposing that oozie set max-attempts to 1 for its launcher 
action so that there are no split brain issues - RM restart or otherwise. Oozie 
has a retry mechanism anyways which will then be submitted as a new application.

Adding a separate knob just for restart is a hack I don't see any value of. If 
I read your proposal correctly, for launcher jobs, you will set 
restart.am.on.rm.restart to 1 and  restart.am.on.on.failure > 1. Right? Which 
is not correct as I repeated - node failures will cause the same split brain 
issues.

> Handle app recovery differently for AM failures and RM restart
> --
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739027#comment-13739027
 ] 

Alejandro Abdelnur commented on YARN-1055:
--

[~vinodkv], in theory I agree with you. In practice, there are 2 issues we 
Oozie cannot address in the short term:

* 1. Oozie still using a a launcher MRAM
* 2. mr/pig/hive/sqoop/distcp/... fat clients which are not aware of Yarn 
restart/recovery.

#1 will be addressed when Oozie implements an OozieLauncherAM instead 
piggybacking on an MR Map as driver.
#2 it is more complicated and I don't see this one be addressed in the 
short/medium term.

By having distinct knobs differentiating recover after AM failure and after RM 
restart Oozie can handle/recover jobs on the same set of failure scenarios 
possible with Hadoop 1. In order to get folks into Yarn we need to provide 
functional parity.

I suggest having the 2 knobs Karthik proposed {{restart.am.on.rm.restart}} and 
{{restart.am.on.on.failure}} with 
{{restart.am.on.rm.restart=$restar.am.on.am.failure}}. 

Does this sound reasonable?

> Handle app recovery differently for AM failures and RM restart
> --
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739013#comment-13739013
 ] 

Bikas Saha commented on YARN-1058:
--

It could be that history service was not properly shutdown in the first AM. 
Earlier, the AM would receive proper reboot command from the RM and would 
shutdown properly based on the reboot flag being set. Now the AM is getting an 
exception from the RM and so not shutting down properly. This should get fixed 
when we refresh the AM RM token from the saved value.

> Recovery issues on RM Restart with FileSystemRMStateStore
> -
>
> Key: YARN-1058
> URL: https://issues.apache.org/jira/browse/YARN-1058
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> App recovery doesn't work as expected using FileSystemRMStateStore.
> Steps to reproduce:
> - Ran sleep job with a single map and sleep time of 2 mins
> - Restarted RM while the map task is still running
> - The first attempt fails with the following error
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Password not found for ApplicationAttempt 
> appattempt_1376294441253_0001_01
>   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at $Proxy28.finishApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
> {noformat}
> - The second attempt fails with a different error:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
>  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
> any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-292) ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt

2013-08-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739011#comment-13739011
 ] 

Zhijie Shen commented on YARN-292:
--

Did more investigation on this issue:

{code}
2012-12-26 08:41:15,030 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
Calling allocate on removed or non existant application 
appattempt_1356385141279_49525_01
{code}
This log indicates that ArrayIndexOutOfBoundsException happens because the 
application is not found. There're three possibilities where the application is 
not found:

1. The application hasn't been added into FiFoScheduler#applications. If it is 
the case, FiFoScheduler will not send APP_ACCEPTED event to the corresponding 
RMAppAttemptImpl. Without APP_ACCEPTED event, RMAppAttemptImpl will not enter 
SCHEDULED state, and will not go through AMContainerAllocatedTransition to 
ALLOCATED_SAVING consequently. Therefore, this case is impossible.

2. The application has already been removed from FiFoScheduler#applications. To 
trigger the removal operation, the corresponding RMAppAttemptImpl needs to go 
through BaseFinalTransition. 

It is worth mentioning first that RMAppAttemptImpl's transitions are executed 
on the thread of AsyncDispatcher, while YarnScheduler#handle is invoked on the 
thread of SchedulerEventDispatcher. The two threads will execute in parallel, 
indicating that the process of an RMAppAttemptEvent and that of a 
SchedulerEvent may interpolate. However, the processes of two 
RMAppAttemptEvents or two SchedulerEvents will not.

Therefore, AMContainerAllocatedTransition will not start before 
RMAppAttemptImpl has already finished BaseFinalTransition. Nevertheless, when 
RMAppAttemptImpl goes through BaseFinalTransition, it will enter an final state 
as well, such that AMContainerAllocatedTransition will not happen at all. In 
conclusion, this case is impossible as well.

3. The application is in FiFoScheduler#applications, but RMAppAttemptImpl 
doesn't get it. First of all, FiFoScheduler#applications is a TreeMap, which is 
not thread safe (FairScheduler#applications is a HashMap while 
CapcityScheduler#applications is a ConcurrentHashMap). Second, the methods of 
accessing the map are not consistently synchronized, thus, read and write on 
the same map can operate simultaneously. RMAppAttemptImpl on the thread of 
AsyncDispatcher will eventually call FiFoScheduler#applications#get in 
AMContainerAllocatedTransition, while FiFoScheduler on thread of 
SchedulerEventDispatcher will use FiFoScheduler#applications#add|remove. 
Therefore, getting null when the application actually exists happens under a 
big number of concurrent operations.

Please feel free to correct me if you think there's something wrong or missing 
with the analysis. I'm going to work on a patch to fix the problem.

> ResourceManager throws ArrayIndexOutOfBoundsException while handling 
> CONTAINER_ALLOCATED for application attempt
> 
>
> Key: YARN-292
> URL: https://issues.apache.org/jira/browse/YARN-292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.1-alpha
>Reporter: Devaraj K
>Assignee: Zhijie Shen
>
> {code:xml}
> 2012-12-26 08:41:15,030 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
> Calling allocate on removed or non existant application 
> appattempt_1356385141279_49525_01
> 2012-12-26 08:41:15,031 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_ALLOCATED for applicationAttempt 
> application_1356385141279_49525
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
>   at 
> org.apache.hadoop.yarn.server.resour

[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738985#comment-13738985
 ] 

Sandy Ryza commented on YARN-1024:
--

I've been thinking a lot about this, and wanted to propose a modified approach, 
inspired by an offline discussion with Arun and his max-vcores idea 
(https://issues.apache.org/jira/browse/YARN-1024?focusedCommentId=13730074&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13730074).

First, my assumptions about how CPUs work:
* A CPU is essentially a bathtub full of processing power that can be doled out 
to threads, with a limit per thread based on the power of each core within it.
* To give X processing power to a thread means that within a standard unit of 
time, roughly some number of instructions proportional to X can be executed for 
that thread. 
* No more than a certain amount of processing power (the amount of processing 
power per core) can be given to each thread.
* We can use CGroups to say that a task gets some fraction of the system's 
processing power.
* This means that if we have 5 cores with Y processing power each, we can give 
5 threads Y processing power each, or 6 threads 5Y/6 processing power each, but 
we can't give 4 threads 5Y/4 processing power each.
* It never makes sense to use CGroups assign a higher fraction of the system's 
processing power than (numthreads the task can take advantage of / number of 
cores) to a task.
* Equivalently, if my CPU has X processing power per core, it never makes sense 
to assign more than (numthreads the task can take advantage of) * X processing 
power to a task.

So as long as we account for that last constraint, we can essentially view 
processing power as a fluid resource like memory.  With this in mind, we can:
1. Split virtual cores into cores and yarnComputeUnitsPerCore.  Requests can 
include both and nodes can be configured with both.
2. Have a cluster-defined maxComputeUnitsPerCore, which would be the smallest 
yarnComputeUnitsPerCore on any node.  We min all yarnComputeUnitsPerCore 
requests with this number when they hit the RM.
3. Use YCUs, not cores, for scheduling.  I.e. the scheduler thinks of a node's 
CPU capacity in terms of the number of YCUs it can handle and thinks of a 
resource's CPU request in terms of its (normalized yarnComputeUnitsPerCore * # 
cores).  We use YCUs for DRF.
4. If we make YCUs small enough, no need for fractional anything.

This reduces to a number-of-cores-based approach if all containers are 
requested with yarnComputeUnitsPerCore=infinity, and reduces to a YCU approach 
if maxComputeUnitsPerCore is set to infinity.  Predictability, simplicity, and 
scheduling flexibility can be traded off per cluster without overloading the 
same concept with multiple definitions.

This doesn't take into account heteregeneous hardware within a cluster, but I 
think (2) can be tweaked to handle this by holding a value for each node  (can 
elaborate on how this would work).  It also doesn't take into account pinning 
threads to CPUs, but I don't think it's any less extensible for ultimately 
dealing with this than other proposals.

Sorry for the longwindedness.  Bobby, would this provide the flexibility you're 
looking for?

> Define a virtual core unambigiously
> ---
>
> Key: YARN-1024
> URL: https://issues.apache.org/jira/browse/YARN-1024
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
>
> We need to clearly define the meaning of a virtual core unambiguously so that 
> it's easy to migrate applications between clusters.
> For e.g. here is Amazon EC2 definition of ECU: 
> http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
> Essentially we need to clearly define a YARN Virtual Core (YVC).
> Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the 
> equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-337) RM handles killed application tracking URL poorly

2013-08-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738968#comment-13738968
 ] 

Thomas Graves commented on YARN-337:


+1 looks good. Thanks Jason!  Feel free to commit it.

> RM handles killed application tracking URL poorly
> -
>
> Key: YARN-337
> URL: https://issues.apache.org/jira/browse/YARN-337
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>  Labels: usability
> Attachments: YARN-337.patch
>
>
> When the ResourceManager kills an application, it leaves the proxy URL 
> redirecting to the original tracking URL for the application even though the 
> ApplicationMaster is no longer there to service it.  It should redirect it 
> somewhere more useful, like the RM's web page for the application, where the 
> user can find that the application was killed and links to the AM logs.
> In addition, sometimes the AM during teardown from the kill can attempt to 
> unregister and provide an updated tracking URL, but unfortunately the RM has 
> "forgotten" the AM due to the kill and refuses to process the unregistration. 
>  Instead it logs:
> {noformat}
> 2013-01-09 17:37:49,671 [IPC Server handler 2 on 8030] ERROR
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> AppAttemptId doesnt exist in cache appattempt_1357575694478_28614_01
> {noformat}
> It should go ahead and process the unregistration to update the tracking URL 
> since the application offered it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738934#comment-13738934
 ] 

Karthik Kambatla commented on YARN-1058:


I was expecting the first one, and Bikas is right about the second one.

When I kil the job client, the job does finish successfully. However, the AM 
for the recovered attempt fails to write the history. 
{noformat}
2013-08-13 13:57:32,440 ERROR [eventHandlingThread] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[eventHandlingThread,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/tmp/hadoop-yarn/staging/kasha/.staging/job_1376427059607_0002/job_1376427059607_0002_2.jhist:
 File does not exist. Holder DFSClient_NONMAPREDUCE_416024880_1 does not have 
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
...  
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2037)

at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$1.run(JobHistoryEventHandler.java:276)
at java.lang.Thread.run(Thread.java:662)
{noformat}

> Recovery issues on RM Restart with FileSystemRMStateStore
> -
>
> Key: YARN-1058
> URL: https://issues.apache.org/jira/browse/YARN-1058
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> App recovery doesn't work as expected using FileSystemRMStateStore.
> Steps to reproduce:
> - Ran sleep job with a single map and sleep time of 2 mins
> - Restarted RM while the map task is still running
> - The first attempt fails with the following error
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Password not found for ApplicationAttempt 
> appattempt_1376294441253_0001_01
>   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at $Proxy28.finishApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
> {noformat}
> - The second attempt fails with a different error:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
>  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
> any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-08-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-573:


Fix Version/s: 0.23.10

+1 lgtm as well, thanks Mit and Omkar!  I committed this to branch-0.23.

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Fix For: 3.0.0, 0.23.10, 2.1.1-beta
>
> Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, 
> YARN-573.branch-0.23-08132013.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738851#comment-13738851
 ] 

Hadoop QA commented on YARN-1060:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597684/YARN-1060.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1709//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1709//console

This message is automatically generated.

> Two tests in TestFairScheduler are missing @Test annotation
> ---
>
> Key: YARN-1060
> URL: https://issues.apache.org/jira/browse/YARN-1060
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Niranjan Singh
>  Labels: newbie
> Attachments: YARN-1060.patch
>
>
> Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738845#comment-13738845
 ] 

Omkar Vinit Joshi commented on YARN-573:


+1 ..lgtm for branch 0.23

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Fix For: 3.0.0, 2.1.1-beta
>
> Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, 
> YARN-573.branch-0.23-08132013.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-08-13 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-573:
---

Attachment: YARN-573.branch-0.23-08132013.patch

Patch ported to Branch-23

> Shared data structures in Public Localizer and Private Localizer are not 
> Thread safe.
> -
>
> Key: YARN-573
> URL: https://issues.apache.org/jira/browse/YARN-573
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>Priority: Critical
> Fix For: 3.0.0, 2.1.1-beta
>
> Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, 
> YARN-573.branch-0.23-08132013.patch
>
>
> PublicLocalizer
> 1) pending accessed by addResource (part of event handling) and run method 
> (as a part of PublicLocalizer.run() ).
> PrivateLocalizer
> 1) pending accessed by addResource (part of event handling) and 
> findNextResource (i.remove()). Also update method should be fixed. It too is 
> sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738754#comment-13738754
 ] 

Jason Lowe commented on YARN-1036:
--

+1 lgtm as well.  Committing this.

> Distributed Cache gives inconsistent result if cache files get deleted from 
> task tracker 
> -
>
> Key: YARN-1036
> URL: https://issues.apache.org/jira/browse/YARN-1036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 0.23.9
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-1036.branch-0.23.patch, 
> YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch
>
>
> This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
> that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738686#comment-13738686
 ] 

Hadoop QA commented on YARN-1056:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597782/yarn-1056-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1708//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1708//console

This message is automatically generated.

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738651#comment-13738651
 ] 

Jian He commented on YARN-1056:
---

Looks good, +1

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-435) Make it easier to access cluster topology information in an AM

2013-08-13 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738647#comment-13738647
 ] 

shenhong commented on YARN-435:
---

Firstly, if AM get all nodes in the cluster including their rack information by 
calling RM. This will increase pressure on the RM's network. For example, the 
cluster had more than 5000 datanodes.

Secondly, if the yarn cluster only has 100 nodemanagers, but the hdfs it 
accessed is a cluster with more than 5000 datanodes, we can't get all the nodes 
including their rack information. However, AM need all the datanode information 
in it's job.splitmetainfo file, in order to init TaskAttempt. In this case, we 
can't get all nodes by calling RM.

> Make it easier to access cluster topology information in an AM
> --
>
> Key: YARN-435
> URL: https://issues.apache.org/jira/browse/YARN-435
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Omkar Vinit Joshi
>
> ClientRMProtocol exposes a getClusterNodes api that provides a report on all 
> nodes in the cluster including their rack information. 
> However, this requires the AM to open and establish a separate connection to 
> the RM in addition to one for the AMRMProtocol. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Attachment: yarn-1056-2.patch

Updated fs config to be fs.state-store.uri instead of fs.rm-state-store.uri

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1030) Adding AHS as service of RM

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738622#comment-13738622
 ] 

Hadoop QA commented on YARN-1030:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597781/YARN-1030.2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1707//console

This message is automatically generated.

> Adding AHS as service of RM
> ---
>
> Key: YARN-1030
> URL: https://issues.apache.org/jira/browse/YARN-1030
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1030.1.patch, YARN-1030.2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1030) Adding AHS as service of RM

2013-08-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1030:
--

Attachment: YARN-1030.2.patch

Thanks [~devaraj.k] for your review. I've updated patch according to your 
comments. If YARN-953 is committed first, I'll remove the change in pom.xml in 
this patch. Now I keep the change not to break the build.

> Adding AHS as service of RM
> ---
>
> Key: YARN-1030
> URL: https://issues.apache.org/jira/browse/YARN-1030
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-1030.1.patch, YARN-1030.2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong resolved YARN-1062.


Resolution: Duplicate

> MRAppMaster take a long time to init taskAttempt
> 
>
> Key: YARN-1062
> URL: https://issues.apache.org/jira/browse/YARN-1062
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 0.23.6
>Reporter: shenhong
>
> In our cluster, MRAppMaster take a long time to init taskAttempt, the 
> following log last one minute,
> 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
> /r03b05
> 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
> /r03b02
> 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> The reason is: resolved one host to rack almost take 25ms (We resolve the 
> host to rack by a python script). Our hdfs cluster is more than 4000 
> datanodes, then a large input job will take a long time to init TaskAttempt.
> Is there any good idea to solve this problem. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738601#comment-13738601
 ] 

Omkar Vinit Joshi commented on YARN-1061:
-

Are you able to reproduce this scenario? Can you please enable DEBUG 
(HADOOP_ROOT_LOGGER & YARN_ROOT_LOGGER) logs and attach them to this jira? How 
big is your cluster? what is the frequency at which nodemanagers are 
heartbeating? Can you also attach yarn-site.xml? which version are you using?

> NodeManager is indefinitely waiting for nodeHeartBeat() response from 
> ResouceManager.
> -
>
> Key: YARN-1061
> URL: https://issues.apache.org/jira/browse/YARN-1061
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
>Reporter: Rohith Sharma K S
>
> It is observed that in one of the scenario, NodeManger is indefinetly waiting 
> for nodeHeartbeat response from ResouceManger where ResouceManger is in 
> hanged up state.
> NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738596#comment-13738596
 ] 

shenhong commented on YARN-1062:


Thanks Vinod Kumar Vavilapalli, I think  YARN-435 is okay to me.


> MRAppMaster take a long time to init taskAttempt
> 
>
> Key: YARN-1062
> URL: https://issues.apache.org/jira/browse/YARN-1062
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 0.23.6
>Reporter: shenhong
>
> In our cluster, MRAppMaster take a long time to init taskAttempt, the 
> following log last one minute,
> 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
> /r03b05
> 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
> /r03b02
> 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> The reason is: resolved one host to rack almost take 25ms (We resolve the 
> host to rack by a python script). Our hdfs cluster is more than 4000 
> datanodes, then a large input job will take a long time to init TaskAttempt.
> Is there any good idea to solve this problem. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738595#comment-13738595
 ] 

Karthik Kambatla commented on YARN-1056:


[~jianhe], good point. Let me upload a patch including that change shortly.

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738582#comment-13738582
 ] 

Hadoop QA commented on YARN-1056:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597773/yarn-1056-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1706//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1706//console

This message is automatically generated.

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738572#comment-13738572
 ] 

Jian He commented on YARN-1056:
---

Hi [~kkambatl], do you think it's also necessary to change 
'yarn.resourcemanager.fs.rm-state-store.uri' to 
'yarn.resourcemanager.fs.state-store.uri' for consistency with 'zk.state-store' 
?

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738570#comment-13738570
 ] 

Hadoop QA commented on YARN-1021:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597774/YARN-1021.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-assemblies hadoop-tools/hadoop-sls hadoop-tools/hadoop-tools-dist.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1705//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1705//console

This message is automatically generated.

> Yarn Scheduler Load Simulator
> -
>
> Key: YARN-1021
> URL: https://issues.apache.org/jira/browse/YARN-1021
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf
>
>
> The Yarn Scheduler is a fertile area of interest with different 
> implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
> several optimizations are also made to improve scheduler performance for 
> different scenarios and workload. Each scheduler algorithm has its own set of 
> features, and drives scheduling decisions by many factors, such as fairness, 
> capacity guarantee, resource availability, etc. It is very important to 
> evaluate a scheduler algorithm very well before we deploy it in a production 
> cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
> algorithm. Evaluating in a real cluster is always time and cost consuming, 
> and it is also very hard to find a large-enough cluster. Hence, a simulator 
> which can predict how well a scheduler algorithm for some specific workload 
> would be quite useful.
> We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
> clusters and application loads in a single machine. This would be invaluable 
> in furthering Yarn by providing a tool for researchers and developers to 
> prototype new scheduler features and predict their behavior and performance 
> with reasonable amount of confidence, there-by aiding rapid innovation.
> The simulator will exercise the real Yarn ResourceManager removing the 
> network factor by simulating NodeManagers and ApplicationMasters via handling 
> and dispatching NM/AMs heartbeat events from within the same JVM.
> To keep tracking of scheduler behavior and performance, a scheduler wrapper 
> will wrap the real scheduler.
> The simulator will produce real time metrics while executing, including:
> * Resource usages for whole cluster and each queue, which can be utilized to 
> configure cluster and queue's capacity.
> * The detailed application execution trace (recorded in relation to simulated 
> time), which can be analyzed to understand/validate the  scheduler behavior 
> (individual jobs turn around time, throughput, fairness, capacity guarantee, 
> etc).
> * Several key metrics of scheduler algorithm, such as time cost of each 
> scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
> developers to find the code spots and scalability limits.
> The simulator will provide real time charts showing the behavior of the 
> scheduler and its performance.
> A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
> how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

2013-08-13 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738556#comment-13738556
 ] 

Wei Yan commented on YARN-1021:
---

Updates of the patch: Reduce the number of threads needed for NMSimulators. 
Before, each NMSimulator uses one thread (for its AsyncDispatcher). Currently 
removed AsyncDispatcher and the total number of threads needed only depends on 
the thread pool size.

> Yarn Scheduler Load Simulator
> -
>
> Key: YARN-1021
> URL: https://issues.apache.org/jira/browse/YARN-1021
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf
>
>
> The Yarn Scheduler is a fertile area of interest with different 
> implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
> several optimizations are also made to improve scheduler performance for 
> different scenarios and workload. Each scheduler algorithm has its own set of 
> features, and drives scheduling decisions by many factors, such as fairness, 
> capacity guarantee, resource availability, etc. It is very important to 
> evaluate a scheduler algorithm very well before we deploy it in a production 
> cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
> algorithm. Evaluating in a real cluster is always time and cost consuming, 
> and it is also very hard to find a large-enough cluster. Hence, a simulator 
> which can predict how well a scheduler algorithm for some specific workload 
> would be quite useful.
> We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
> clusters and application loads in a single machine. This would be invaluable 
> in furthering Yarn by providing a tool for researchers and developers to 
> prototype new scheduler features and predict their behavior and performance 
> with reasonable amount of confidence, there-by aiding rapid innovation.
> The simulator will exercise the real Yarn ResourceManager removing the 
> network factor by simulating NodeManagers and ApplicationMasters via handling 
> and dispatching NM/AMs heartbeat events from within the same JVM.
> To keep tracking of scheduler behavior and performance, a scheduler wrapper 
> will wrap the real scheduler.
> The simulator will produce real time metrics while executing, including:
> * Resource usages for whole cluster and each queue, which can be utilized to 
> configure cluster and queue's capacity.
> * The detailed application execution trace (recorded in relation to simulated 
> time), which can be analyzed to understand/validate the  scheduler behavior 
> (individual jobs turn around time, throughput, fairness, capacity guarantee, 
> etc).
> * Several key metrics of scheduler algorithm, such as time cost of each 
> scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
> developers to find the code spots and scalability limits.
> The simulator will provide real time charts showing the behavior of the 
> scheduler and its performance.
> A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
> how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738560#comment-13738560
 ] 

Arun C Murthy commented on YARN-1056:
-

Looks fine, I'll commit after jenkins. Thanks [~kkambatl].

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738541#comment-13738541
 ] 

Omkar Vinit Joshi commented on YARN-1036:
-

+1 ... thanks for updating the patch..lgtm.

> Distributed Cache gives inconsistent result if cache files get deleted from 
> task tracker 
> -
>
> Key: YARN-1036
> URL: https://issues.apache.org/jira/browse/YARN-1036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 0.23.9
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-1036.branch-0.23.patch, 
> YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch
>
>
> This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
> that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator

2013-08-13 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1021:
--

Attachment: YARN-1021.patch

A new patch updates some code in the NMSimulator.

> Yarn Scheduler Load Simulator
> -
>
> Key: YARN-1021
> URL: https://issues.apache.org/jira/browse/YARN-1021
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf
>
>
> The Yarn Scheduler is a fertile area of interest with different 
> implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
> several optimizations are also made to improve scheduler performance for 
> different scenarios and workload. Each scheduler algorithm has its own set of 
> features, and drives scheduling decisions by many factors, such as fairness, 
> capacity guarantee, resource availability, etc. It is very important to 
> evaluate a scheduler algorithm very well before we deploy it in a production 
> cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
> algorithm. Evaluating in a real cluster is always time and cost consuming, 
> and it is also very hard to find a large-enough cluster. Hence, a simulator 
> which can predict how well a scheduler algorithm for some specific workload 
> would be quite useful.
> We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
> clusters and application loads in a single machine. This would be invaluable 
> in furthering Yarn by providing a tool for researchers and developers to 
> prototype new scheduler features and predict their behavior and performance 
> with reasonable amount of confidence, there-by aiding rapid innovation.
> The simulator will exercise the real Yarn ResourceManager removing the 
> network factor by simulating NodeManagers and ApplicationMasters via handling 
> and dispatching NM/AMs heartbeat events from within the same JVM.
> To keep tracking of scheduler behavior and performance, a scheduler wrapper 
> will wrap the real scheduler.
> The simulator will produce real time metrics while executing, including:
> * Resource usages for whole cluster and each queue, which can be utilized to 
> configure cluster and queue's capacity.
> * The detailed application execution trace (recorded in relation to simulated 
> time), which can be analyzed to understand/validate the  scheduler behavior 
> (individual jobs turn around time, throughput, fairness, capacity guarantee, 
> etc).
> * Several key metrics of scheduler algorithm, such as time cost of each 
> scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
> developers to find the code spots and scalability limits.
> The simulator will provide real time charts showing the behavior of the 
> scheduler and its performance.
> A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
> how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Target Version/s: 2.1.0-beta  (was: 2.1.1-beta)

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Attachment: yarn-1056-1.patch

Reuploading patch to kick Jenkins.

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch, yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Priority: Blocker  (was: Major)

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
>  Labels: conf
> Attachments: yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-979) [YARN-321] Adding application attempt and container to ApplicationHistoryProtocol

2013-08-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738519#comment-13738519
 ] 

Zhijie Shen commented on YARN-979:
--

There're some high-level comments on the patch:

1. To make the protocol work, it is required to define the corresponding protos 
in yarn_service.proto and update pplication_history_service.proto

2. The setter of the request/response APIs should be @Public, shouldn't it?

3. It's required to mark ApplicationHistoryProtocol as well.

> [YARN-321] Adding application attempt and container to 
> ApplicationHistoryProtocol
> -
>
> Key: YARN-979
> URL: https://issues.apache.org/jira/browse/YARN-979
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-979-1.patch
>
>
>  Adding application attempt and container to ApplicationHistoryProtocol
> Thanks,
> Mayank

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738484#comment-13738484
 ] 

Hadoop QA commented on YARN-1036:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12597762/YARN-1036.branch-0.23.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1704//console

This message is automatically generated.

> Distributed Cache gives inconsistent result if cache files get deleted from 
> task tracker 
> -
>
> Key: YARN-1036
> URL: https://issues.apache.org/jira/browse/YARN-1036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 0.23.9
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-1036.branch-0.23.patch, 
> YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch
>
>
> This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
> that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1036:
---

Attachment: YARN-1036.branch-0.23.patch

Thanks Jason and Omkar for your comments. Ok. Here is the updated patch which 
has src/main code exactly like Omkar suggested.
I've tested it by using a pendrive to simulate drive failure, and the file is 
indeed localized again.

> Distributed Cache gives inconsistent result if cache files get deleted from 
> task tracker 
> -
>
> Key: YARN-1036
> URL: https://issues.apache.org/jira/browse/YARN-1036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 0.23.9
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-1036.branch-0.23.patch, 
> YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch
>
>
> This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
> that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1023) [YARN-321] Webservices REST API's support for Application History

2013-08-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1023:
--

Summary: [YARN-321] Webservices REST API's support for Application History  
(was: [YARN-321] Weservices REST API's support for Application History)

> [YARN-321] Webservices REST API's support for Application History
> -
>
> Key: YARN-1023
> URL: https://issues.apache.org/jira/browse/YARN-1023
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-321
>Reporter: Devaraj K
>Assignee: Devaraj K
> Attachments: YARN-1023-v0.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738417#comment-13738417
 ] 

Vinod Kumar Vavilapalli commented on YARN-1062:
---

You should definitely see if you can improve your python script by looking at a 
static resolution file instead of dynamically pinging DNS at run time. That'll 
clearly improve your performance.

Overall, we wish to expose this information to AMs from RM itself so that each 
AM doesn't need to do this itself. That is tracked via YARN-435. If that okay 
with you, please close this as duplicate.

> MRAppMaster take a long time to init taskAttempt
> 
>
> Key: YARN-1062
> URL: https://issues.apache.org/jira/browse/YARN-1062
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 0.23.6
>Reporter: shenhong
>
> In our cluster, MRAppMaster take a long time to init taskAttempt, the 
> following log last one minute,
> 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
> /r03b05
> 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
> /r03b02
> 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> The reason is: resolved one host to rack almost take 25ms (We resolve the 
> host to rack by a python script). Our hdfs cluster is more than 4000 
> datanodes, then a large input job will take a long time to init TaskAttempt.
> Is there any good idea to solve this problem. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738413#comment-13738413
 ] 

Vinod Kumar Vavilapalli commented on YARN-1056:
---

Config changes ARE API changes. If you wish to rename it now itself, mark it as 
blocker and let the release manager now. Otherwise, you should deprecate this 
config and add a new one and wait for the next release. I'm okay either ways.

> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
> 
>
> Key: YARN-1056
> URL: https://issues.apache.org/jira/browse/YARN-1056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: conf
> Attachments: yarn-1056-1.patch
>
>
> Fix configs 
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
>  to have a *resourcemanager* only once, make them consistent with other such 
> yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated YARN-1062:
---

Description: 
In our cluster, MRAppMaster take a long time to init taskAttempt, the following 
log last one minute,

2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
/r03b05
2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED
2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
/r03b02
2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
/r02f02
2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
/r02f02
2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED

The reason is: resolved one host to rack almost take 25ms (We resolve the host 
to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then 
a large input job will take a long time to init TaskAttempt.

Is there any good idea to solve this problem. 

  was:
In our cluster, MRAppMaster take a long time to init taskAttempt, the following 
log last one minute,

2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
/r03b05
2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED

The reason is: resolved one host to rack almost take 25ms, our hdfs cluster is 
more than 4000 datanodes, then a large input job will take a long time to init 
TaskAttempt.

Is there any good idea to solve this problem.


> MRAppMaster take a long time to init taskAttempt
> 
>
> Key: YARN-1062
> URL: https://issues.apache.org/jira/browse/YARN-1062
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Affects Versions: 0.23.6
>Reporter: shenhong
>
> In our cluster, MRAppMaster take a long time to init taskAttempt, the 
> following log last one minute,
> 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
> /r01f11
> 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
> /r03b05
> 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
> /r03b02
> 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
> /r02f02
> 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
> UNASSIGNED
> The reason is: resolved one host to rack almost take 25ms (We resolve the 
> host to rack by a python script). Our hdfs cluster is more than 4000 
> datanodes, then a large input job will take a long time to init TaskAttempt.
> Is there any good idea to solve this 

[jira] [Created] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)
shenhong created YARN-1062:
--

 Summary: MRAppMaster take a long time to init taskAttempt
 Key: YARN-1062
 URL: https://issues.apache.org/jira/browse/YARN-1062
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 0.23.6
Reporter: shenhong


In our cluster, MRAppMaster take a long time to init taskAttempt, the following 
log last one minute,

2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
/r03b05
2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED

The reason is: resolved one host to rack almost take 25ms, our hdfs cluster is 
more than 4000 datanodes, then a large input job will take a long time to init 
TaskAttempt.

Is there any good idea to solve this problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738337#comment-13738337
 ] 

Jason Lowe commented on YARN-1036:
--

Agree with Ravi that we should focus on porting the change to 0.23 and fix any 
issues that also apply to trunk/branch-2 in a separate JIRA.  Therefore I agree 
with Omkar that we should simply break or omit the LOCALIZED case from the 
switch statement since 0.23 doesn't have localCacheDirectoryManager to match 
the trunk behavior.  Otherwise patch looks good to me.

> Distributed Cache gives inconsistent result if cache files get deleted from 
> task tracker 
> -
>
> Key: YARN-1036
> URL: https://issues.apache.org/jira/browse/YARN-1036
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 0.23.9
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch
>
>
> This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
> that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1008) MiniYARNCluster with multiple nodemanagers, all nodes have same key for allocations

2013-08-13 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738175#comment-13738175
 ] 

Alejandro Abdelnur commented on YARN-1008:
--

What the patch does:

Introduces a new configuration property in the RM, 
{{RM_SCHEDULER_USE_PORT_FOR_NODE_NAME}} (defaulting to {{false}}). 

This is an RM property but it is to be used by the scheduler implementations 
when matching a resourcerequest to a node.

If the property is set to {{false}} things work as today, the matching is done 
using only the HOSTNAME obtained from the NodeId.

If the property is set to {{true}}, the matching is done using the 
HOSTNAME:PORT obtained from the NodeId.

There are not changes in the NM or AM side. 

If the property is set to {{true}}, the AM must be aware of such setting and 
when creating  resource requests, it must set the location as HOSTNAME:PORT of 
the node instead of just HOSTNAME.

The renaming of {{SchedulerNode#getHostName()}} to 
{{SchedulerNode#getNodeName()}} is to make obvious to developers that may not 
be HOSTNAME. Added javadocs explaing this clearly.

This works with all 3 schedulers.

The main motivation for this change is to be able to use yarn minicluster with 
multiple NMs and being able to target a specific NM instance.

We could expose this for production use if there is a need. For that we would 
need to:

* Expose via ApplicationReport the node matching mode: HOSTNAME or HOSTNAME:PORT
* Provide a mechanism for AMs only aware of HOSTNAME matching mode to work with 
HOSTNAME:PORT mode

If there is a usecase for this in a real deployment we should follow up this 
with another JIRA.


> MiniYARNCluster with multiple nodemanagers, all nodes have same key for 
> allocations
> ---
>
> Key: YARN-1008
> URL: https://issues.apache.org/jira/browse/YARN-1008
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.1.0-beta
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
> Attachments: YARN-1008.patch, YARN-1008.patch, YARN-1008.patch
>
>
> While the NMs are keyed using the NodeId, the allocation is done based on the 
> hostname. 
> This makes the different nodes indistinguishable to the scheduler.
> There should be an option to enabled the host:port instead just port for 
> allocations. The nodes reported to the AM should report the 'key' (host or 
> host:port). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1059) IllegalArgumentException while starting YARN

2013-08-13 Thread rvller (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738002#comment-13738002
 ] 

rvller commented on YARN-1059:
--

It was my fault, I've changed another parameter to one line in the XML, so 
that's why RM was not able to start.

When I changed all of the parameters to one line 
(10.245.1.30:9030 and etc) RM wa able to start. 

I suppose that this is a bug, because it's confusing.

> IllegalArgumentException while starting YARN
> 
>
> Key: YARN-1059
> URL: https://issues.apache.org/jira/browse/YARN-1059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.5-alpha
> Environment: Ubuntu 12.04, hadoop 2.0.5
>Reporter: rvller
>
> Here is the traceback while starting the yarn resourse manager:
> 2013-08-12 12:53:29,319 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
> ResourceManager
> java.lang.IllegalArgumentException: Does not contain a valid host:port 
> authority: 
> 10.245.1.30:9030
>  (configuration property 'yarn.resourcemanager.resource-tracker.address')
>   at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:193)
>   at 
> org.apache.hadoop.conf.Configuration.getSocketAddr(Configuration.java:1450)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.init(ResourceTrackerService.java:105)
>   at 
> org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.init(ResourceManager.java:255)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:710)
> And here is the yarn-site.xml:
> 
> 
> 
> yarn.resourcemanager.address
> 
> 
> 10.245.1.30:9010
> 
> 
> 
> 
> 
> 
> yarn.resourcemanager.scheduler.address
> 
> 
> 10.245.1.30:9020
> 
> 
> 
> 
> 
> 
> yarn.resourcemanager.resource-tracker.address
> 
> 
> 10.245.1.30:9030
> 
> 
> 
> 
> 
> 
> yarn.resourcemanager.admin.address
> 
> 
> 10.245.1.30:9040
> 
> 
> 
> 
> 
> 
> yarn.resourcemanager.webapp.address
> 
> 
> 10.245.1.30:9050
> 
> 
> 
> 
> 
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737990#comment-13737990
 ] 

Rohith Sharma K S commented on YARN-1061:
-

Extracted thread dump from NodeManager is 

{noformat}
"Node Status Updater" prio=10 tid=0x414dc000 nid=0x1d754 in 
Object.wait() [0x7fefa2dec000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:1231)
- locked <0xdef4f158> (a org.apache.hadoop.ipc.Client$Call)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy28.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy30.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:348)
{noformat}

> NodeManager is indefinitely waiting for nodeHeartBeat() response from 
> ResouceManager.
> -
>
> Key: YARN-1061
> URL: https://issues.apache.org/jira/browse/YARN-1061
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
>Reporter: Rohith Sharma K S
>
> It is observed that in one of the scenario, NodeManger is indefinetly waiting 
> for nodeHeartbeat response from ResouceManger where ResouceManger is in 
> hanged up state.
> NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-1061:
---

 Summary: NodeManager is indefinitely waiting for nodeHeartBeat() 
response from ResouceManager.
 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S


It is observed that in one of the scenario, NodeManger is indefinetly waiting 
for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged 
up state.

NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Niranjan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Singh updated YARN-1060:
-

Attachment: YARN-1060.patch

Added @Test annotations

> Two tests in TestFairScheduler are missing @Test annotation
> ---
>
> Key: YARN-1060
> URL: https://issues.apache.org/jira/browse/YARN-1060
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Niranjan Singh
>  Labels: newbie
> Attachments: YARN-1060.patch
>
>
> Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Niranjan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Singh reassigned YARN-1060:


Assignee: Niranjan Singh

> Two tests in TestFairScheduler are missing @Test annotation
> ---
>
> Key: YARN-1060
> URL: https://issues.apache.org/jira/browse/YARN-1060
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.1.0-beta
>Reporter: Sandy Ryza
>Assignee: Niranjan Singh
>  Labels: newbie
>
> Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira