[jira] [Commented] (YARN-4026) FiCaSchedulerApp: ContainerAllocator should be able to choose how to order pending resource requests

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693023#comment-14693023
 ] 

Hadoop QA commented on YARN-4026:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 21s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 49s | The applied patch generated  3 
new checkstyle issues (total was 128, now 128). |
| {color:red}-1{color} | whitespace |   0m  5s | The patch has 30  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 29s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  53m 24s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  91m 53s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12750004/YARN-4026.3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3ae716f |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8829/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8829/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8829/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8829/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8829/console |


This message was automatically generated.

> FiCaSchedulerApp: ContainerAllocator should be able to choose how to order 
> pending resource requests
> 
>
> Key: YARN-4026
> URL: https://issues.apache.org/jira/browse/YARN-4026
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4026.1.patch, YARN-4026.2.patch, YARN-4026.3.patch
>
>
> After YARN-3983, we have an extensible ContainerAllocator which can be used 
> by FiCaSchedulerApp to decide how to allocate resources.
> While working on YARN-1651 (allocate resource to increase container), I found 
> one thing in existing logic not flexible enough:
> - ContainerAllocator decides what to allocate for a given node and priority: 
> To support different kinds of resource allocation, for example, priority as 
> weight / skip priority or not, etc. It's better to let ContainerAllocator to 
> choose how to order pending resource requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692916#comment-14692916
 ] 

Sangjin Lee commented on YARN-2859:
---

[~zjshen], can this be done for 2.6.1, or are you OK with deferring it to 2.6.2?

> ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
> --
>
> Key: YARN-2859
> URL: https://issues.apache.org/jira/browse/YARN-2859
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Hitesh Shah
>Assignee: Zhijie Shen
>Priority: Critical
>  Labels: 2.6.1-candidate
>
> In mini cluster, a random port should be used. 
> Also, the config is not updated to the host that the process got bound to.
> {code}
> 2014-11-13 13:07:01,905 INFO  [main] server.MiniYARNCluster 
> (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer 
> address: localhost:10200
> 2014-11-13 13:07:01,905 INFO  [main] server.MiniYARNCluster 
> (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer 
> web address: 0.0.0.0:8188
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2746) YARNDelegationTokenID misses serializing version from the common abstract ID

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2746:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> YARNDelegationTokenID misses serializing version from the common abstract ID
> 
>
> Key: YARN-2746
> URL: https://issues.apache.org/jira/browse/YARN-2746
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jian He
>
> I found this during review of YARN-2743.
> bq. AbstractDTId had a version, we dropped that in the protobuf 
> serialization. We should just write it during the serialization and read it 
> back?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2657) MiniYARNCluster to (optionally) add MicroZookeeper service

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2657:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> MiniYARNCluster to (optionally) add MicroZookeeper service
> --
>
> Key: YARN-2657
> URL: https://issues.apache.org/jira/browse/YARN-2657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2567-001.patch, YARN-2657-002.patch
>
>
> This is needed for testing things like YARN-2646: add an option for the 
> {{MiniYarnCluster}} to start a {{MicroZookeeperService}}.
> This is just another YARN service to create and track the lifecycle. The 
> {{MicroZookeeperService}} publishes its binding information for direct takeup 
> by the registry services...this can address in-VM race conditions.
> The default setting for this service is "off"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2599) Standby RM should also expose some jmx and metrics

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2599:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2506) TimelineClient should NOT be in yarn-common project

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2506:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> TimelineClient should NOT be in yarn-common project
> ---
>
> Key: YARN-2506
> URL: https://issues.apache.org/jira/browse/YARN-2506
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
>Priority: Critical
>
> YARN-2298 incorrectly moved TimelineClient to yarn-common project. It doesn't 
> belong there, we should move it back to yarn-client module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2037) Add restart support for Unmanaged AMs

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2037:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Add restart support for Unmanaged AMs
> -
>
> Key: YARN-2037
> URL: https://issues.apache.org/jira/browse/YARN-2037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> It would be nice to allow Unmanaged AMs also to restart in a work-preserving 
> way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2457:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> FairScheduler: Handle preemption to help starved parent queues
> --
>
> Key: YARN-2457
> URL: https://issues.apache.org/jira/browse/YARN-2457
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't 
> check for parent queue starvation. 
> We need to check that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2055:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1848) Persist ClusterMetrics across RM HA transitions

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1848:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Persist ClusterMetrics across RM HA transitions
> ---
>
> Key: YARN-1848
> URL: https://issues.apache.org/jira/browse/YARN-1848
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Post YARN-1705, ClusterMetrics are reset on transition to standby. This is 
> acceptable as the metrics show statistics since an RM has become active. 
> Users might want to see metrics since the cluster was ever started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2014:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>Assignee: Jason Lowe
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1856) cgroups based memory monitoring for containers

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1856:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> cgroups based memory monitoring for containers
> --
>
> Key: YARN-1856
> URL: https://issues.apache.org/jira/browse/YARN-1856
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Varun Vasudev
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1480:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480-6.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1767) Windows: Allow a way for users to augment classpath of YARN daemons

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1767:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Windows: Allow a way for users to augment classpath of YARN daemons
> ---
>
> Key: YARN-1767
> URL: https://issues.apache.org/jira/browse/YARN-1767
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>
> YARN-1429 adds a way to augment the classpath for *nix-based systems. Need 
> something similar for Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1681) When "banned.users" is not set in LCE's container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS will receive unclear error message

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1681:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> When "banned.users" is not set in LCE's container-executor.cfg, submit job 
> with user in DEFAULT_BANNED_USERS will receive unclear error message
> ---
>
> Key: YARN-1681
> URL: https://issues.apache.org/jira/browse/YARN-1681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Zhichun Wu
>Assignee: Zhichun Wu
>  Labels: container, usability
> Attachments: YARN-1681.patch
>
>
> When using LCE in a secure setup, if "banned.users" is not set in 
> container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS 
> ("mapred", "hdfs", "bin", 0)  will receive unclear error message.
> for example, if we use hdfs to submit a mr job, we may see the following the 
> yarn app overview page:
> {code}
> appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: 
> Application application_1391353981633_0003 initialization failed 
> (exitCode=139) with output: 
> {code}
> while the prefer error message may look like:
> {code}
> appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: 
> Application application_1391353981633_0003 initialization failed 
> (exitCode=139) with output: Requested user hdfs is banned 
> {code}
> just a minor bug and I would like to start contributing to hadoop-common with 
> it:)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3999) RM hangs on draing events

2015-08-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692901#comment-14692901
 ] 

Hudson commented on YARN-3999:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8286 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8286/])
YARN-3999. RM hangs on draing events. Contributed by Jian He (xgong: rev 
3ae716fa696b87e849dae40225dc59fb5ed114cb)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/TestAsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


> RM hangs on draing events
> -
>
> Key: YARN-3999
> URL: https://issues.apache.org/jira/browse/YARN-3999
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.7.2
>
> Attachments: YARN-3999-branch-2.7.patch, YARN-3999.1.patch, 
> YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, YARN-3999.4.patch, 
> YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692843#comment-14692843
 ] 

Hadoop QA commented on YARN-313:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m 19s | Findbugs (version 3.0.0) 
appears to be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  3s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 56s | The applied patch generated  4 
new checkstyle issues (total was 229, now 232). |
| {color:green}+1{color} | whitespace |   0m  6s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 37s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   5m 39s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 27s | Tests passed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |   6m 58s | Tests failed in 
hadoop-yarn-client. |
| {color:red}-1{color} | yarn tests |   2m  0s | Tests failed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  53m 35s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 111m  5s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.client.cli.TestRMAdminCLI |
|   | hadoop.yarn.client.api.impl.TestYarnClient |
|   | hadoop.yarn.util.TestRackResolver |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12749993/YARN-313-v7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3ae716f |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8828/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8828/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8828/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8828/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8828/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8828/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8828/console |


This message was automatically generated.

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch, 
> YARN-313-v6.patch, YARN-313-v7.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4026) FiCaSchedulerApp: ContainerAllocator should be able to choose how to order pending resource requests

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692827#comment-14692827
 ] 

Hadoop QA commented on YARN-4026:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 16s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 18s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 17s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 54s | The applied patch generated  3 
new checkstyle issues (total was 128, now 128). |
| {color:red}-1{color} | whitespace |   0m  5s | The patch has 30  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 36s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  53m 47s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  94m 41s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMAdminService |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12750004/YARN-4026.3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 7c796fd |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8827/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8827/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8827/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8827/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8827/console |


This message was automatically generated.

> FiCaSchedulerApp: ContainerAllocator should be able to choose how to order 
> pending resource requests
> 
>
> Key: YARN-4026
> URL: https://issues.apache.org/jira/browse/YARN-4026
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4026.1.patch, YARN-4026.2.patch, YARN-4026.3.patch
>
>
> After YARN-3983, we have an extensible ContainerAllocator which can be used 
> by FiCaSchedulerApp to decide how to allocate resources.
> While working on YARN-1651 (allocate resource to increase container), I found 
> one thing in existing logic not flexible enough:
> - ContainerAllocator decides what to allocate for a given node and priority: 
> To support different kinds of resource allocation, for example, priority as 
> weight / skip priority or not, etc. It's better to let ContainerAllocator to 
> choose how to order pending resource requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2038:
--
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Revisit how AMs learn of containers from previous attempts
> --
>
> Key: YARN-2038
> URL: https://issues.apache.org/jira/browse/YARN-2038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Based on YARN-556, we need to update the way AMs learn about containers 
> allocation previous attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-313:
-
Target Version/s: 2.7.2, 2.6.2  (was: 2.6.1, 2.7.2)

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch, 
> YARN-313-v6.patch, YARN-313-v7.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3999) RM hangs on draing events

2015-08-11 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692729#comment-14692729
 ] 

Xuan Gong commented on YARN-3999:
-

Thanks, Jian. Committed into trunk/branch-2/branch-2.7.

> RM hangs on draing events
> -
>
> Key: YARN-3999
> URL: https://issues.apache.org/jira/browse/YARN-3999
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.7.2
>
> Attachments: YARN-3999-branch-2.7.patch, YARN-3999.1.patch, 
> YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, YARN-3999.4.patch, 
> YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-08-11 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692709#comment-14692709
 ] 

sandflee commented on YARN-2038:


I thought it's the same issue to YARN-3519, but it seems not. I'm also 
confusing what the purpose of this issue now

> Revisit how AMs learn of containers from previous attempts
> --
>
> Key: YARN-2038
> URL: https://issues.apache.org/jira/browse/YARN-2038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Based on YARN-556, we need to update the way AMs learn about containers 
> allocation previous attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3999) RM hangs on draing events

2015-08-11 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3999:
--
Attachment: YARN-3999-branch-2.7.patch

upload branch-2.7 patch

> RM hangs on draing events
> -
>
> Key: YARN-3999
> URL: https://issues.apache.org/jira/browse/YARN-3999
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-3999-branch-2.7.patch, YARN-3999.1.patch, 
> YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, YARN-3999.4.patch, 
> YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4026) FiCaSchedulerApp: ContainerAllocator should be able to choose how to order pending resource requests

2015-08-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4026:
-
Attachment: YARN-4026.3.patch

Attached ver.3, added more comments and fixed findbugs warning.

> FiCaSchedulerApp: ContainerAllocator should be able to choose how to order 
> pending resource requests
> 
>
> Key: YARN-4026
> URL: https://issues.apache.org/jira/browse/YARN-4026
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4026.1.patch, YARN-4026.2.patch, YARN-4026.3.patch
>
>
> After YARN-3983, we have an extensible ContainerAllocator which can be used 
> by FiCaSchedulerApp to decide how to allocate resources.
> While working on YARN-1651 (allocate resource to increase container), I found 
> one thing in existing logic not flexible enough:
> - ContainerAllocator decides what to allocate for a given node and priority: 
> To support different kinds of resource allocation, for example, priority as 
> weight / skip priority or not, etc. It's better to let ContainerAllocator to 
> choose how to order pending resource requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4046:

Attachment: YARN-4046.002.patch

Fixed whitespace 

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4046.002.patch, YARN-4046.002.patch, 
> YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4046:

Attachment: YARN-4046.002.patch

Fixed a new checkstyle that was added, the other two are preexisting and should 
not be fixed.

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4046.002.patch, YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4026) FiCaSchedulerApp: ContainerAllocator should be able to choose how to order pending resource requests

2015-08-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4026:
-
Attachment: YARN-4026.2.patch

Thanks for comments [~jianhe], attached ver.2 patch.

> FiCaSchedulerApp: ContainerAllocator should be able to choose how to order 
> pending resource requests
> 
>
> Key: YARN-4026
> URL: https://issues.apache.org/jira/browse/YARN-4026
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4026.1.patch, YARN-4026.2.patch
>
>
> After YARN-3983, we have an extensible ContainerAllocator which can be used 
> by FiCaSchedulerApp to decide how to allocate resources.
> While working on YARN-1651 (allocate resource to increase container), I found 
> one thing in existing logic not flexible enough:
> - ContainerAllocator decides what to allocate for a given node and priority: 
> To support different kinds of resource allocation, for example, priority as 
> weight / skip priority or not, etc. It's better to let ContainerAllocator to 
> choose how to order pending resource requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-08-11 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692632#comment-14692632
 ] 

Inigo Goiri commented on YARN-313:
--

Not critical, I think it can be deferred.
I would appreciate ideas on why this change breaks the refreshNodes with a 
graceful period.

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch, 
> YARN-313-v6.patch, YARN-313-v7.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2457:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> FairScheduler: Handle preemption to help starved parent queues
> --
>
> Key: YARN-2457
> URL: https://issues.apache.org/jira/browse/YARN-2457
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't 
> check for parent queue starvation. 
> We need to check that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-313:
-

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch, 
> YARN-313-v6.patch, YARN-313-v7.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2038:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Revisit how AMs learn of containers from previous attempts
> --
>
> Key: YARN-2038
> URL: https://issues.apache.org/jira/browse/YARN-2038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Based on YARN-556, we need to update the way AMs learn about containers 
> allocation previous attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1681) When "banned.users" is not set in LCE's container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS will receive unclear error message

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1681:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> When "banned.users" is not set in LCE's container-executor.cfg, submit job 
> with user in DEFAULT_BANNED_USERS will receive unclear error message
> ---
>
> Key: YARN-1681
> URL: https://issues.apache.org/jira/browse/YARN-1681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Zhichun Wu
>Assignee: Zhichun Wu
>  Labels: container, usability
> Attachments: YARN-1681.patch
>
>
> When using LCE in a secure setup, if "banned.users" is not set in 
> container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS 
> ("mapred", "hdfs", "bin", 0)  will receive unclear error message.
> for example, if we use hdfs to submit a mr job, we may see the following the 
> yarn app overview page:
> {code}
> appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: 
> Application application_1391353981633_0003 initialization failed 
> (exitCode=139) with output: 
> {code}
> while the prefer error message may look like:
> {code}
> appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: 
> Application application_1391353981633_0003 initialization failed 
> (exitCode=139) with output: Requested user hdfs is banned 
> {code}
> just a minor bug and I would like to start contributing to hadoop-common with 
> it:)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2076) Minor error in TestLeafQueue files

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2076:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Minor error in TestLeafQueue files
> --
>
> Key: YARN-2076
> URL: https://issues.apache.org/jira/browse/YARN-2076
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Chen He
>Assignee: Chen He
>Priority: Minor
>  Labels: test
> Attachments: YARN-2076.patch
>
>
> "numNodes" should be 2 instead of 3 in testReservationExchange() since only 
> two nodes are defined.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1767) Windows: Allow a way for users to augment classpath of YARN daemons

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1767:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Windows: Allow a way for users to augment classpath of YARN daemons
> ---
>
> Key: YARN-1767
> URL: https://issues.apache.org/jira/browse/YARN-1767
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>
> YARN-1429 adds a way to augment the classpath for *nix-based systems. Need 
> something similar for Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2055:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1848) Persist ClusterMetrics across RM HA transitions

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1848:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Persist ClusterMetrics across RM HA transitions
> ---
>
> Key: YARN-1848
> URL: https://issues.apache.org/jira/browse/YARN-1848
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Post YARN-1705, ClusterMetrics are reset on transition to standby. This is 
> acceptable as the metrics show statistics since an RM has become active. 
> Users might want to see metrics since the cluster was ever started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1480:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480-6.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2506) TimelineClient should NOT be in yarn-common project

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2506:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> TimelineClient should NOT be in yarn-common project
> ---
>
> Key: YARN-2506
> URL: https://issues.apache.org/jira/browse/YARN-2506
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
>Priority: Critical
>
> YARN-2298 incorrectly moved TimelineClient to yarn-common project. It doesn't 
> belong there, we should move it back to yarn-client module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2599) Standby RM should also expose some jmx and metrics

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2599:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3478) FairScheduler page not performed because different enum of YarnApplicationState and RMAppState

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3478:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> FairScheduler page not performed because different enum of 
> YarnApplicationState and RMAppState 
> ---
>
> Key: YARN-3478
> URL: https://issues.apache.org/jira/browse/YARN-3478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.0
>Reporter: Xu Chen
> Attachments: YARN-3478.1.patch, YARN-3478.2.patch, YARN-3478.3.patch, 
> screenshot-1.png
>
>
> Got exception from log 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1225)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.DynamicUserWebFilter$DynamicUserFilter.doFilter(DynamicUserWebFilter.java:59)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(Sele

[jira] [Updated] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2859:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
> --
>
> Key: YARN-2859
> URL: https://issues.apache.org/jira/browse/YARN-2859
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Hitesh Shah
>Assignee: Zhijie Shen
>Priority: Critical
>  Labels: 2.6.1-candidate
>
> In mini cluster, a random port should be used. 
> Also, the config is not updated to the host that the process got bound to.
> {code}
> 2014-11-13 13:07:01,905 INFO  [main] server.MiniYARNCluster 
> (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer 
> address: localhost:10200
> 2014-11-13 13:07:01,905 INFO  [main] server.MiniYARNCluster 
> (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer 
> web address: 0.0.0.0:8188
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2746) YARNDelegationTokenID misses serializing version from the common abstract ID

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2746:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> YARNDelegationTokenID misses serializing version from the common abstract ID
> 
>
> Key: YARN-2746
> URL: https://issues.apache.org/jira/browse/YARN-2746
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jian He
>
> I found this during review of YARN-2743.
> bq. AbstractDTId had a version, we dropped that in the protobuf 
> serialization. We should just write it during the serialization and read it 
> back?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2037) Add restart support for Unmanaged AMs

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2037:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Add restart support for Unmanaged AMs
> -
>
> Key: YARN-2037
> URL: https://issues.apache.org/jira/browse/YARN-2037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> It would be nice to allow Unmanaged AMs also to restart in a work-preserving 
> way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1856) cgroups based memory monitoring for containers

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-1856:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> cgroups based memory monitoring for containers
> --
>
> Key: YARN-1856
> URL: https://issues.apache.org/jira/browse/YARN-1856
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Varun Vasudev
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2657) MiniYARNCluster to (optionally) add MicroZookeeper service

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2657:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> MiniYARNCluster to (optionally) add MicroZookeeper service
> --
>
> Key: YARN-2657
> URL: https://issues.apache.org/jira/browse/YARN-2657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2567-001.patch, YARN-2657-002.patch
>
>
> This is needed for testing things like YARN-2646: add an option for the 
> {{MiniYarnCluster}} to start a {{MicroZookeeperService}}.
> This is just another YARN service to create and track the lifecycle. The 
> {{MicroZookeeperService}} publishes its binding information for direct takeup 
> by the registry services...this can address in-VM race conditions.
> The default setting for this service is "off"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2014:
--

Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it 
to 2.6.2. Let me know if you have comments. Thanks!

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>Assignee: Jason Lowe
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-08-11 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-313:
-
Attachment: YARN-313-v7.patch

Updated to trunk. It looks like it still breaks the unit test for the graceful 
refresh but I cannot figure out why.

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch, 
> YARN-313-v6.patch, YARN-313-v7.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2076) Minor error in TestLeafQueue files

2015-08-11 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692605#comment-14692605
 ] 

Chen He commented on YARN-2076:
---

I will update patch.

> Minor error in TestLeafQueue files
> --
>
> Key: YARN-2076
> URL: https://issues.apache.org/jira/browse/YARN-2076
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Chen He
>Assignee: Chen He
>Priority: Minor
>  Labels: test
> Attachments: YARN-2076.patch
>
>
> "numNodes" should be 2 instead of 3 in testReservationExchange() since only 
> two nodes are defined.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-08-11 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated YARN-3978:
--
Labels: 2.6.1-candidate  (was: )

> Configurably turn off the saving of container info in Generic AHS
> -
>
> Key: YARN-3978
> URL: https://issues.apache.org/jira/browse/YARN-3978
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver, yarn
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>  Labels: 2.6.1-candidate
> Fix For: 3.0.0, 2.8.0, 2.7.2
>
> Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
> YARN-3978.003.patch, YARN-3978.004.patch
>
>
> Depending on how each application's metadata is stored, one week's worth of 
> data stored in the Generic Application History Server's database can grow to 
> be almost a terabyte of local disk space. In order to alleviate this, I 
> suggest that there is a need for a configuration option to turn off saving of 
> non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692546#comment-14692546
 ] 

Sangjin Lee commented on YARN-313:
--

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch, 
> YARN-313-v6.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692538#comment-14692538
 ] 

Sangjin Lee commented on YARN-1856:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> cgroups based memory monitoring for containers
> --
>
> Key: YARN-1856
> URL: https://issues.apache.org/jira/browse/YARN-1856
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Varun Vasudev
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692545#comment-14692545
 ] 

Sangjin Lee commented on YARN-1480:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480-6.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1681) When "banned.users" is not set in LCE's container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS will receive unclear error message

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692544#comment-14692544
 ] 

Sangjin Lee commented on YARN-1681:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> When "banned.users" is not set in LCE's container-executor.cfg, submit job 
> with user in DEFAULT_BANNED_USERS will receive unclear error message
> ---
>
> Key: YARN-1681
> URL: https://issues.apache.org/jira/browse/YARN-1681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Zhichun Wu
>Assignee: Zhichun Wu
>  Labels: container, usability
> Attachments: YARN-1681.patch
>
>
> When using LCE in a secure setup, if "banned.users" is not set in 
> container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS 
> ("mapred", "hdfs", "bin", 0)  will receive unclear error message.
> for example, if we use hdfs to submit a mr job, we may see the following the 
> yarn app overview page:
> {code}
> appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: 
> Application application_1391353981633_0003 initialization failed 
> (exitCode=139) with output: 
> {code}
> while the prefer error message may look like:
> {code}
> appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: 
> Application application_1391353981633_0003 initialization failed 
> (exitCode=139) with output: Requested user hdfs is banned 
> {code}
> just a minor bug and I would like to start contributing to hadoop-common with 
> it:)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1767) Windows: Allow a way for users to augment classpath of YARN daemons

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692543#comment-14692543
 ] 

Sangjin Lee commented on YARN-1767:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Windows: Allow a way for users to augment classpath of YARN daemons
> ---
>
> Key: YARN-1767
> URL: https://issues.apache.org/jira/browse/YARN-1767
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>
> YARN-1429 adds a way to augment the classpath for *nix-based systems. Need 
> something similar for Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692537#comment-14692537
 ] 

Sangjin Lee commented on YARN-2014:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>Assignee: Jason Lowe
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1848) Persist ClusterMetrics across RM HA transitions

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692541#comment-14692541
 ] 

Sangjin Lee commented on YARN-1848:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Persist ClusterMetrics across RM HA transitions
> ---
>
> Key: YARN-1848
> URL: https://issues.apache.org/jira/browse/YARN-1848
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Post YARN-1705, ClusterMetrics are reset on transition to standby. This is 
> acceptable as the metrics show statistics since an RM has become active. 
> Users might want to see metrics since the cluster was ever started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692533#comment-14692533
 ] 

Sangjin Lee commented on YARN-2038:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Revisit how AMs learn of containers from previous attempts
> --
>
> Key: YARN-2038
> URL: https://issues.apache.org/jira/browse/YARN-2038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Based on YARN-556, we need to update the way AMs learn about containers 
> allocation previous attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692532#comment-14692532
 ] 

Sangjin Lee commented on YARN-2055:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2037) Add restart support for Unmanaged AMs

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692534#comment-14692534
 ] 

Sangjin Lee commented on YARN-2037:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Add restart support for Unmanaged AMs
> -
>
> Key: YARN-2037
> URL: https://issues.apache.org/jira/browse/YARN-2037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> It would be nice to allow Unmanaged AMs also to restart in a work-preserving 
> way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2076) Minor error in TestLeafQueue files

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692529#comment-14692529
 ] 

Sangjin Lee commented on YARN-2076:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Minor error in TestLeafQueue files
> --
>
> Key: YARN-2076
> URL: https://issues.apache.org/jira/browse/YARN-2076
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Chen He
>Assignee: Chen He
>Priority: Minor
>  Labels: test
> Attachments: YARN-2076.patch
>
>
> "numNodes" should be 2 instead of 3 in testReservationExchange() since only 
> two nodes are defined.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2506) TimelineClient should NOT be in yarn-common project

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692527#comment-14692527
 ] 

Sangjin Lee commented on YARN-2506:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> TimelineClient should NOT be in yarn-common project
> ---
>
> Key: YARN-2506
> URL: https://issues.apache.org/jira/browse/YARN-2506
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
>Priority: Critical
>
> YARN-2298 incorrectly moved TimelineClient to yarn-common project. It doesn't 
> belong there, we should move it back to yarn-client module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692524#comment-14692524
 ] 

Hadoop QA commented on YARN-3045:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m 17s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 9 new or modified test files. |
| {color:red}-1{color} | javac |   7m 55s | The applied patch generated  3  
additional warning messages. |
| {color:green}+1{color} | javadoc |   9m 53s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 48s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  8s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   2m 46s | The patch appears to introduce 5 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   8m  6s | Tests passed in 
hadoop-yarn-applications-distributedshell. |
| {color:red}-1{color} | yarn tests |   6m  4s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| {color:green}+1{color} | yarn tests |   1m 22s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  55m 53s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-nodemanager |
| Failed unit tests | hadoop.yarn.server.nodemanager.TestDeletionService |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12749943/YARN-3045-YARN-2928.009.patch
 |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | YARN-2928 / 07433c2 |
| javac | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/artifact/patchprocess/diffJavacWarnings.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8826/console |


This message was automatically generated.

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045-YARN-2928.005.patch, YARN-3045-YARN-2928.006.patch, 
> YARN-3045-YARN-2928.007.patch, YARN-3045-YARN-2928.008.patch, 
> YARN-3045-YARN-2928.009.patch, YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692528#comment-14692528
 ] 

Sangjin Lee commented on YARN-2457:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> FairScheduler: Handle preemption to help starved parent queues
> --
>
> Key: YARN-2457
> URL: https://issues.apache.org/jira/browse/YARN-2457
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't 
> check for parent queue starvation. 
> We need to check that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2657) MiniYARNCluster to (optionally) add MicroZookeeper service

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692525#comment-14692525
 ] 

Sangjin Lee commented on YARN-2657:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> MiniYARNCluster to (optionally) add MicroZookeeper service
> --
>
> Key: YARN-2657
> URL: https://issues.apache.org/jira/browse/YARN-2657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2567-001.patch, YARN-2657-002.patch
>
>
> This is needed for testing things like YARN-2646: add an option for the 
> {{MiniYarnCluster}} to start a {{MicroZookeeperService}}.
> This is just another YARN service to create and track the lifecycle. The 
> {{MicroZookeeperService}} publishes its binding information for direct takeup 
> by the registry services...this can address in-VM race conditions.
> The default setting for this service is "off"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692526#comment-14692526
 ] 

Sangjin Lee commented on YARN-2599:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2746) YARNDelegationTokenID misses serializing version from the common abstract ID

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692523#comment-14692523
 ] 

Sangjin Lee commented on YARN-2746:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> YARNDelegationTokenID misses serializing version from the common abstract ID
> 
>
> Key: YARN-2746
> URL: https://issues.apache.org/jira/browse/YARN-2746
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jian He
>
> I found this during review of YARN-2743.
> bq. AbstractDTId had a version, we dropped that in the protobuf 
> serialization. We should just write it during the serialization and read it 
> back?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692521#comment-14692521
 ] 

Sangjin Lee commented on YARN-2859:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
> --
>
> Key: YARN-2859
> URL: https://issues.apache.org/jira/browse/YARN-2859
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Hitesh Shah
>Assignee: Zhijie Shen
>Priority: Critical
>  Labels: 2.6.1-candidate
>
> In mini cluster, a random port should be used. 
> Also, the config is not updated to the host that the process got bound to.
> {code}
> 2014-11-13 13:07:01,905 INFO  [main] server.MiniYARNCluster 
> (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer 
> address: localhost:10200
> 2014-11-13 13:07:01,905 INFO  [main] server.MiniYARNCluster 
> (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer 
> web address: 0.0.0.0:8188
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3478) FairScheduler page not performed because different enum of YarnApplicationState and RMAppState

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692520#comment-14692520
 ] 

Sangjin Lee commented on YARN-3478:
---

Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me 
know.

> FairScheduler page not performed because different enum of 
> YarnApplicationState and RMAppState 
> ---
>
> Key: YARN-3478
> URL: https://issues.apache.org/jira/browse/YARN-3478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.0
>Reporter: Xu Chen
> Attachments: YARN-3478.1.patch, YARN-3478.2.patch, YARN-3478.3.patch, 
> screenshot-1.png
>
>
> Got exception from log 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1225)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.DynamicUserWebFilter$DynamicUserFilter.doFilter(DynamicUserWebFilter.java:59)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint

[jira] [Commented] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692518#comment-14692518
 ] 

Hadoop QA commented on YARN-4046:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  7s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 59s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 52s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m  9s | The applied patch generated  3 
new checkstyle issues (total was 97, now 99). |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 21s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 57s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  22m 42s | Tests failed in 
hadoop-common. |
| | |  63m  5s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.ha.TestZKFailoverController |
|   | hadoop.net.TestNetUtils |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12749949/YARN-4096.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 7c796fd |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/diffcheckstylehadoop-common.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8825/artifact/patchprocess/testrun_hadoop-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8825/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8825/console |


This message was automatically generated.

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4025) Deal with byte representations of Longs in writer code

2015-08-11 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692511#comment-14692511
 ] 

Vrushali C commented on YARN-4025:
--

Yes, +1 

> Deal with byte representations of Longs in writer code
> --
>
> Key: YARN-4025
> URL: https://issues.apache.org/jira/browse/YARN-4025
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-4025-YARN-2928.001.patch
>
>
> Timestamps are being stored as Longs in hbase by the HBaseTimelineWriterImpl 
> code. There seem to be some places in the code where there are conversions 
> between Long to byte[] to String for easier argument passing between function 
> calls. Then these values end up being converted back to byte[] while storing 
> in hbase. 
> It would be better to pass around byte[] or the Longs themselves  as 
> applicable. 
> This may result in some api changes (store function) as well in adding a few 
> more function calls like getColumnQualifier which accepts a pre-encoded byte 
> array. It will be in addition to the existing api which accepts a String and 
> the ColumnHelper to return a byte[] column name instead of a String one. 
> Filing jira to track these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3906) split the application table from the entity table

2015-08-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692476#comment-14692476
 ] 

Junping Du commented on YARN-3906:
--

Ok. Committing this patch now.

> split the application table from the entity table
> -
>
> Key: YARN-3906
> URL: https://issues.apache.org/jira/browse/YARN-3906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-3906-YARN-2928.001.patch, 
> YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch, 
> YARN-3906-YARN-2928.004.patch, YARN-3906-YARN-2928.005.patch, 
> YARN-3906-YARN-2928.006.patch, YARN-3906-YARN-2928.007.patch
>
>
> Per discussions on YARN-3815, we need to split the application entities from 
> the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4025) Deal with byte representations of Longs in writer code

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692468#comment-14692468
 ] 

Sangjin Lee commented on YARN-4025:
---

For the record, we will go ahead with YARN-3906 first. We'll need to update 
this patch to reflect the changes in YARN-3906. I'll work with [~vrushalic] on 
that.

> Deal with byte representations of Longs in writer code
> --
>
> Key: YARN-4025
> URL: https://issues.apache.org/jira/browse/YARN-4025
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-4025-YARN-2928.001.patch
>
>
> Timestamps are being stored as Longs in hbase by the HBaseTimelineWriterImpl 
> code. There seem to be some places in the code where there are conversions 
> between Long to byte[] to String for easier argument passing between function 
> calls. Then these values end up being converted back to byte[] while storing 
> in hbase. 
> It would be better to pass around byte[] or the Longs themselves  as 
> applicable. 
> This may result in some api changes (store function) as well in adding a few 
> more function calls like getColumnQualifier which accepts a pre-encoded byte 
> array. It will be in addition to the existing api which accepts a String and 
> the ColumnHelper to return a byte[] column name instead of a String one. 
> Filing jira to track these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3906) split the application table from the entity table

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692465#comment-14692465
 ] 

Sangjin Lee commented on YARN-3906:
---

I checked with [~vrushalic], and we decided to put the patch for this JIRA 
(YARN-3906) first.

> split the application table from the entity table
> -
>
> Key: YARN-3906
> URL: https://issues.apache.org/jira/browse/YARN-3906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-3906-YARN-2928.001.patch, 
> YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch, 
> YARN-3906-YARN-2928.004.patch, YARN-3906-YARN-2928.005.patch, 
> YARN-3906-YARN-2928.006.patch, YARN-3906-YARN-2928.007.patch
>
>
> Per discussions on YARN-3815, we need to split the application entities from 
> the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4047) ClientRMService getApplications has high scheduler lock contention

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692412#comment-14692412
 ] 

Hadoop QA commented on YARN-4047:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 47s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   8m  1s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 59s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 50s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 22s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 27s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  53m 27s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  92m 53s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12749935/YARN-4047.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 7c796fd |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8824/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8824/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8824/console |


This message was automatically generated.

> ClientRMService getApplications has high scheduler lock contention
> --
>
> Key: YARN-4047
> URL: https://issues.apache.org/jira/browse/YARN-4047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>  Labels: 2.6.1-candidate
> Attachments: YARN-4047.001.patch
>
>
> The getApplications call can be particuarly expensive because the code can 
> call checkAccess on every application being tracked by the RM.  checkAccess 
> will often call scheduler.checkAccess which will grab the big scheduler lock. 
>  This can cause a lot of contention with the scheduler thread which is busy 
> trying to process node heartbeats, app allocation requests, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14687395#comment-14687395
 ] 

Anubhav Dhoot commented on YARN-4046:
-

[~cnauroth] appreciate your review

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4026) FiCaSchedulerApp: ContainerAllocator should be able to choose how to order pending resource requests

2015-08-11 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14687398#comment-14687398
 ] 

Jian He commented on YARN-4026:
---

- why "assignment.setFulfilledReservation(true);" is called in Reserved state ?
{code}
  if (result.getAllocationState() == AllocationState.RESERVED) {
// This is a reserved container
LOG.info("Reserved container " + " application="
+ application.getApplicationId() + " resource=" + allocatedResource
+ " queue=" + this.toString() + " cluster=" + clusterResource);
assignment.getAssignmentInformation().addReservationDetails(
updatedContainer.getId(),
application.getCSLeafQueue().getQueuePath());
assignment.getAssignmentInformation().incrReservations();
Resources.addTo(assignment.getAssignmentInformation().getReserved(),
allocatedResource);
assignment.setFulfilledReservation(true);
  } else {

{code}
-  I think here can always return ContainerAllocation.LOCALITY_SKIPPED as the 
semantics of this method is to try to allocate a container for certain 
locality. 
{code}
  return type == NodeType.OFF_SWITCH ? ContainerAllocation.APP_SKIPPED
  : ContainerAllocation.LOCALITY_SKIPPED;
{code}
The caller here can choose to return APP_SKIPPED if it sees the LOCALITY_SKIPPED
{code}
  assigned =
  assignOffSwitchContainers(clusterResource, offSwitchResourceRequest,
  node, priority, reservedContainer, schedulingMode,
  currentResoureLimits);
  assigned.requestNodeType = requestType;

  return assigned;
}
{code}


> FiCaSchedulerApp: ContainerAllocator should be able to choose how to order 
> pending resource requests
> 
>
> Key: YARN-4026
> URL: https://issues.apache.org/jira/browse/YARN-4026
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4026.1.patch
>
>
> After YARN-3983, we have an extensible ContainerAllocator which can be used 
> by FiCaSchedulerApp to decide how to allocate resources.
> While working on YARN-1651 (allocate resource to increase container), I found 
> one thing in existing logic not flexible enough:
> - ContainerAllocator decides what to allocate for a given node and priority: 
> To support different kinds of resource allocation, for example, priority as 
> weight / skip priority or not, etc. It's better to let ContainerAllocator to 
> choose how to order pending resource requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4046:

Description: 
On a debian machine we have seen node manager recovery of containers fail 
because the signal syntax for process group may not work. We see errors in 
checking if process is alive during container recovery which causes the 
container to be declared as LOST (154) on a NodeManager restart.

The application will fail with error. The attempts are not retried.
{noformat}
Application application_1439244348718_0001 failed 1 times due to Attempt 
recovered after RM restartAM Container for appattempt_1439244348718_0001_01 
exited with exitCode: 154
{noformat}


  was:
On a debian machine we have seen node manager recovery of containers fail 
because the signal syntax for process group may not work. We see errors in 
checking if process is alive during container recovery which causes the 
container to be declared as LOST (154) on a NodeManager restart.

The application will fail with error
{noformat}
Application application_1439244348718_0001 failed 1 times due to Attempt 
recovered after RM restartAM Container for appattempt_1439244348718_0001_01 
exited with exitCode: 154
{noformat}



> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4046:

Attachment: YARN-4096.001.patch

Attaching patch that prefixes "--" when using negative pid for kill

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14687389#comment-14687389
 ] 

Wangda Tan commented on YARN-4045:
--

[~tgraves]/[~shahrs87], 

I think the case could happen when container reservation interacts with node 
disconnect, one example is:
{code}
A cluster has 6 nodes, each node has 20G resource, and usage is
N1-N4, are all used
N5-N6, both of them are used 10G.
An app ask 15G container, assume it is reserved at N5, so total used resource = 
20G * 4 + 10G * 2 + 15G (just reserved) = 115G
Then, N6 disconnected, now cluster resource becomes 100G, and used resource = 
105G.
{code}

I've just checked fixes, YARN-3361 doesn't have related fixes. And currently we 
don't have a fix for above corner case. 

Another problem is caused by DRC, from 2.7.1, we have set availableResource = 
max(availableResource, Resources.none()). 
{code}
childQueue.getMetrics().setAvailableResourcesToQueue(
Resources.max(
calculator, 
clusterResource, 
available, 
Resources.none()
)
);
{code}

But if you're using DRC, if a resource has availableMB < 0 and availableVCores 
> 0, it could report such resource > Resources.None(). We may need to fix this 
case as well.

Thoughts?

> Negative avaialbleMB is being reported for root queue.
> --
>
> Key: YARN-4045
> URL: https://issues.apache.org/jira/browse/YARN-4045
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Rushabh S Shah
>
> We recently deployed 2.7 in one of our cluster.
> We are seeing negative availableMB being reported for queue=root.
> This is from the jmx output:
> {noformat}
> 
> ...
> -163328
> ...
> 
> {noformat}
> The following is the RM log:
> {noformat}
> 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue

[jira] [Commented] (YARN-3999) RM hangs on draing events

2015-08-11 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682496#comment-14682496
 ] 

Xuan Gong commented on YARN-3999:
-

+1 lgtm. Will commit later if there are no other comments

> RM hangs on draing events
> -
>
> Key: YARN-3999
> URL: https://issues.apache.org/jira/browse/YARN-3999
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, 
> YARN-3999.3.patch, YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, 
> YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-08-11 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3045:

Attachment: YARN-3045-YARN-2928.009.patch

Hi [~djp],
Attaching a new patching resolving your comments also have modified one 
approach,, for cases where we to publish the timeline entities  directly (not  
through wrapped application or container events) like ContainerMetrics, i have 
added a new NMTimelineEvent which accepts the TimelineEntity and ApplicationId, 
this approach avoids creating new event classes and would just suffice exposing 
method in NMTimelinePublisher.
Also have fixed the test case failures but the javac warnings seems not to be 
related to my modifications and findbugs dint have any issue reported in the 
report. will check for it in next jenkins run

> [Event producers] Implement NM writing container lifecycle events to ATS
> 
>
> Key: YARN-3045
> URL: https://issues.apache.org/jira/browse/YARN-3045
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3045-YARN-2928.002.patch, 
> YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
> YARN-3045-YARN-2928.005.patch, YARN-3045-YARN-2928.006.patch, 
> YARN-3045-YARN-2928.007.patch, YARN-3045-YARN-2928.008.patch, 
> YARN-3045-YARN-2928.009.patch, YARN-3045.20150420-1.patch
>
>
> Per design in YARN-2928, implement NM writing container lifecycle events and 
> container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4047) ClientRMService getApplications has high scheduler lock contention

2015-08-11 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated YARN-4047:
--
Labels: 2.6.1-candidate  (was: )

> ClientRMService getApplications has high scheduler lock contention
> --
>
> Key: YARN-4047
> URL: https://issues.apache.org/jira/browse/YARN-4047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>  Labels: 2.6.1-candidate
> Attachments: YARN-4047.001.patch
>
>
> The getApplications call can be particuarly expensive because the code can 
> call checkAccess on every application being tracked by the RM.  checkAccess 
> will often call scheduler.checkAccess which will grab the big scheduler lock. 
>  This can cause a lot of contention with the scheduler thread which is busy 
> trying to process node heartbeats, app allocation requests, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4047) ClientRMService getApplications has high scheduler lock contention

2015-08-11 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-4047:
-
Attachment: YARN-4047.001.patch

Patch that performs the checkAccess filter last rather than first.

> ClientRMService getApplications has high scheduler lock contention
> --
>
> Key: YARN-4047
> URL: https://issues.apache.org/jira/browse/YARN-4047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-4047.001.patch
>
>
> The getApplications call can be particuarly expensive because the code can 
> call checkAccess on every application being tracked by the RM.  checkAccess 
> will often call scheduler.checkAccess which will grab the big scheduler lock. 
>  This can cause a lot of contention with the scheduler thread which is busy 
> trying to process node heartbeats, app allocation requests, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended

2015-08-11 Thread Dustin Cote (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682379#comment-14682379
 ] 

Dustin Cote commented on YARN-2369:
---

[~jlowe] thanks for all the input.  I'll clean this latest patch up based on 
these comments this week.

Happy to throw this in the MAPREDUCE project instead as well, since basically 
all the changes are in the MR client.  I don't think sub JIRAs would be 
necessary since it's a pretty small change on the YARN side, but I leave that 
to the project management experts.  I don't see any organizational problem 
keeping it all in one JIRA here.  

> Environment variable handling assumes values should be appended
> ---
>
> Key: YARN-2369
> URL: https://issues.apache.org/jira/browse/YARN-2369
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Jason Lowe
>Assignee: Dustin Cote
> Attachments: YARN-2369-1.patch, YARN-2369-2.patch, YARN-2369-3.patch, 
> YARN-2369-4.patch, YARN-2369-5.patch, YARN-2369-6.patch
>
>
> When processing environment variables for a container context the code 
> assumes that the value should be appended to any pre-existing value in the 
> environment.  This may be desired behavior for handling path-like environment 
> variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a 
> non-intuitive and harmful way to handle any variable that does not have 
> path-like semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4047) ClientRMService getApplications has high scheduler lock contention

2015-08-11 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-4047:


Assignee: Jason Lowe

In OOZIE-1729 Oozie started calling getApplications to look for applications 
with specific tags.  This significantly increases the utilization of this 
method on a cluster that makes heavy use of Oozie.

One quick fix for the Oozie use-case may be to swap the filter order.  Rather 
than doing the expensive checkAccess call first, we can do all the other 
filtering first and finally verify the user has access before adding the app to 
the response.  In the Oozie scenario most apps will be filtered by the tag 
check before we ever get to the checkAccess call.

> ClientRMService getApplications has high scheduler lock contention
> --
>
> Key: YARN-4047
> URL: https://issues.apache.org/jira/browse/YARN-4047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>
> The getApplications call can be particuarly expensive because the code can 
> call checkAccess on every application being tracked by the RM.  checkAccess 
> will often call scheduler.checkAccess which will grab the big scheduler lock. 
>  This can cause a lot of contention with the scheduler thread which is busy 
> trying to process node heartbeats, app allocation requests, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4047) ClientRMService getApplications has high scheduler lock contention

2015-08-11 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4047:


 Summary: ClientRMService getApplications has high scheduler lock 
contention
 Key: YARN-4047
 URL: https://issues.apache.org/jira/browse/YARN-4047
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jason Lowe


The getApplications call can be particuarly expensive because the code can call 
checkAccess on every application being tracked by the RM.  checkAccess will 
often call scheduler.checkAccess which will grab the big scheduler lock.  This 
can cause a lot of contention with the scheduler thread which is busy trying to 
process node heartbeats, app allocation requests, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4023) Publish Application Priority to TimelineServer

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682360#comment-14682360
 ] 

Hadoop QA commented on YARN-4023:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  24m 12s | Pre-patch trunk has 7 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 40s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 57s | Site still builds. |
| {color:red}-1{color} | checkstyle |   2m 38s | The applied patch generated  1 
new checkstyle issues (total was 16, now 16). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 23s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 31s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   7m 22s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   6m 56s | Tests passed in 
hadoop-yarn-client. |
| {color:red}-1{color} | yarn tests |   1m 53s | Tests failed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   3m 13s | Tests passed in 
hadoop-yarn-server-applicationhistoryservice. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-common. |
| {color:red}-1{color} | yarn tests |  53m 22s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 123m 59s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.util.TestRackResolver |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12749303/0001-YARN-4023.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / 1fc3c77 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-applicationhistoryservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8823/console |


This message was automatically generated.

> Publish Application Priority to TimelineServer
> --
>
> Key: YARN-4023
> URL: https://issues.apache.org/jira/browse/YARN-4023
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4023.patch, 0001-YARN-4023.patch, 
> ApplicationPage.png, TimelineserverMainpage.png
>
>
> Publish Application priority details to Timeline Server. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682293#comment-14682293
 ] 

Anubhav Dhoot commented on YARN-4046:
-

As per GNU linux 
[documentation|http://www.gnu.org/software/coreutils/manual/html_node/kill-invocation.html#kill-invocation]
 "--" may not be needed, but looks like all distros (Debian) do not support  
not having "--".
{noformat} If a negative pid argument is desired as the first one, it should be 
preceded by --. However, as a common extension to POSIX, -- is not required 
with ‘kill -signal -pid’. {noformat}
So a fix is to prefix "--" always to match the recommendation.

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682284#comment-14682284
 ] 

Anubhav Dhoot commented on YARN-4046:
-

The error in NodeManager shows 
{noformat}
2015-08-10 15:14:05,567 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
 Unable to recover container container_e45_1439244348718_0001_01_01
java.io.IOException: Timeout while waiting for exit code from 
container_e45_1439244348718_0001_01_01
at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Looking under the debugger the actual shell command to check if container is 
alive fails because the kill command syntax  "kill -0 -20773" fails.
{noformat}
his = {org.apache.hadoop.util.Shell$ShellCommandExecutor@6740} "kill -0 -20773 "
builder = {java.lang.ProcessBuilder@6789} 
 command = {java.util.ArrayList@6813}  size = 3
 directory = null
 environment = null
 redirectErrorStream = false
 redirects = null
timeOutTimer = null
timeoutTimerTask = null
errReader = {java.io.BufferedReader@6830} 
inReader = {java.io.BufferedReader@6833} 
errMsg = {java.lang.StringBuffer@6836} "kill: invalid option -- '2'\n\nUsage:\n 
kill [options]  [...]\n\nOptions:\n  [...]send signal to 
every  listed\n -, -s, --signal \n
specify the  to be sent\n -l, --list=[]  list all signal names, 
or convert one to a name\n -L, --tablelist all signal names in a 
nice table\n\n -h, --help display this help and exit\n -V, --version  
output version information and exit\n\nFor more details see kill(1).\n"
errThread = {org.apache.hadoop.util.Shell$1@6839} "Thread[Thread-102,5,]"
line = null
exitCode = 1
completed = {java.util.concurrent.atomic.AtomicBoolean@6806} "true"
{noformat}

This causes DefaultContainerExecutor#containerIsAlive to catch 
ExitCodeException thrown by ShellCommandExecutor.execute making it assume the 
container is lost.

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4046) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

2015-08-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4046:

Summary: Applications fail on NM restart on some linux distro because NM 
container recovery declares AM container as LOST  (was: NM container recovery 
is broken on some linux distro because of syntax of signal)

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> 
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
>
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_01 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4046) NM container recovery is broken on some linux distro because of syntax of signal

2015-08-11 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4046:
---

 Summary: NM container recovery is broken on some linux distro 
because of syntax of signal
 Key: YARN-4046
 URL: https://issues.apache.org/jira/browse/YARN-4046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Critical


On a debian machine we have seen node manager recovery of containers fail 
because the signal syntax for process group may not work. We see errors in 
checking if process is alive during container recovery which causes the 
container to be declared as LOST (154) on a NodeManager restart.

The application will fail with error
{noformat}
Application application_1439244348718_0001 failed 1 times due to Attempt 
recovered after RM restartAM Container for appattempt_1439244348718_0001_01 
exited with exitCode: 154
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-08-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682240#comment-14682240
 ] 

Rohith Sharma K S commented on YARN-3979:
-

I had look at the RM logs shared, I strongly suspect that it is because of the 
same reason in YARN-3990.
>From the shared log, I see below logs which indicates that asyncdispatcher is 
>overloaded with unnecessary events. May be you can use patch of YARN-3990 and 
>test it.
{noformat}
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
BJHC-HERA-18352.hadoop.jd.local:50086 Node Transitioned from RUNNING to LOST
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 
BJHC-HADOOP-HERA-17280.jd.local to /rack/rack4065
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2515000
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2515000
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node BJHC-HADOOP-HERA-17280.jd.local(cmPort: 50086 httpPort: 
8042) registered with capability: , assigned nodeId 
BJHC-HADOOP-HERA-17280.jd.local:50086
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 
BJHC-HERA-164102.hadoop.jd.local to /rack/rack41007
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node BJHC-HERA-164102.hadoop.jd.local(cmPort: 50086 httpPort: 
8042) registered with capability: , assigned nodeId 
BJHC-HERA-164102.hadoop.jd.local:50086
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2516000
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2516000
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not 
found resyncing BJHC-HERA-18043.hadoop.jd.local:50086
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2517000
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2517000
2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2518000
2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2518000
2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2519000
{noformat}

> Am in ResourceLocalizationService hang 10 min cause RM kill  AM
> ---
>
> Key: YARN-3979
> URL: https://issues.apache.org/jira/browse/YARN-3979
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: CentOS 6.5  Hadoop-2.2.0
>Reporter: zhangyubiao
> Attachments: ERROR103.log
>
>
> 2015-07-27 02:46:17,348 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1437735375558
> _104282_01_01
> 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
> 2015-07-27 02:56:18,510 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for appattempt_1437735375558_104282_0
> 1 (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682118#comment-14682118
 ] 

Rushabh S Shah commented on YARN-4045:
--

bq. Thanks Rushabh S Shah for reporting this. One doubt, Which 
ResourceCalculator is used here? Is it Dominant RC.
yes.

> Negative avaialbleMB is being reported for root queue.
> --
>
> Key: YARN-4045
> URL: https://issues.apache.org/jira/browse/YARN-4045
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Rushabh S Shah
>
> We recently deployed 2.7 in one of our cluster.
> We are seeing negative availableMB being reported for queue=root.
> This is from the jmx output:
> {noformat}
> 
> ...
> -163328
> ...
> 
> {noformat}
> The following is the RM log:
> {noformat}
> 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:47,401 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root used

[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682115#comment-14682115
 ] 

Thomas Graves commented on YARN-4045:
-

I remember seeing that this was fixed in branch-2 by some of the capacity 
scheduler work for labels.

I thought this might be fixed by 
https://issues.apache.org/jira/browse/YARN-3243 but that is included.  

This might be fixed as part of https://issues.apache.org/jira/browse/YARN-3361 
which is probably to big to backport totally.

[~leftnoteasy]  Do you remember this issue?

Note that it also shows up in capacity scheduler UI as root queue going over 
100%.  I remember when I was testing YARN-3434 it wasn't occurring for me on 
branch-2 (2.8) and I thought it was one of the above jiras that fixed.

> Negative avaialbleMB is being reported for root queue.
> --
>
> Key: YARN-4045
> URL: https://issues.apache.org/jira/browse/YARN-4045
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Rushabh S Shah
>
> We recently deployed 2.7 in one of our cluster.
> We are seeing negative availableMB being reported for queue=root.
> This is from the jmx output:
> {noformat}
> 
> ...
> -163328
> ...
> 
> {noformat}
> The following is the RM log:
> {noformat}
> 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,886 [

[jira] [Commented] (YARN-3999) RM hangs on draing events

2015-08-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682112#comment-14682112
 ] 

Rohith Sharma K S commented on YARN-3999:
-

thank [~jianhe] for the explanation. Overall patch looks good to me.. 

> RM hangs on draing events
> -
>
> Key: YARN-3999
> URL: https://issues.apache.org/jira/browse/YARN-3999
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, 
> YARN-3999.3.patch, YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, 
> YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682110#comment-14682110
 ] 

Sunil G commented on YARN-4045:
---

Thanks [~shahrs87] for reporting this. One doubt, Which ResourceCalculator is 
used here? Is it Dominant RC.

> Negative avaialbleMB is being reported for root queue.
> --
>
> Key: YARN-4045
> URL: https://issues.apache.org/jira/browse/YARN-4045
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Rushabh S Shah
>
> We recently deployed 2.7 in one of our cluster.
> We are seeing negative availableMB being reported for queue=root.
> This is from the jmx output:
> {noformat}
> 
> ...
> -163328
> ...
> 
> {noformat}
> The following is the RM log:
> {noformat}
> 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:47,401 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> abso

[jira] [Commented] (YARN-3906) split the application table from the entity table

2015-08-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682102#comment-14682102
 ] 

Junping Du commented on YARN-3906:
--

Thanks [~sjlee0] for the patch work and [~gtCarrera9] for review! Latest patch 
LGTM. However, I will wait for our decision on sequence of YARN-4025.

> split the application table from the entity table
> -
>
> Key: YARN-3906
> URL: https://issues.apache.org/jira/browse/YARN-3906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-3906-YARN-2928.001.patch, 
> YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch, 
> YARN-3906-YARN-2928.004.patch, YARN-3906-YARN-2928.005.patch, 
> YARN-3906-YARN-2928.006.patch, YARN-3906-YARN-2928.007.patch
>
>
> Per discussions on YARN-3815, we need to split the application entities from 
> the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3924) Submitting an application to standby ResourceManager should respond better than Connection Refused

2015-08-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682099#comment-14682099
 ] 

Rohith Sharma K S commented on YARN-3924:
-

I agree with the concern that user should be able to obtain standby exception.I 
am not sure whether this point was discussed when initially RM HA was designed. 
keeping cc:\ [~ka...@cloudera.com] [~jianhe] [~xgong] [~vinodkv] for more 
discussion on this. 

> Submitting an application to standby ResourceManager should respond better 
> than Connection Refused
> --
>
> Key: YARN-3924
> URL: https://issues.apache.org/jira/browse/YARN-3924
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Dustin Cote
>Assignee: Ajith S
>Priority: Minor
>
> When submitting an application directly to a standby resource manager, the 
> resource manager responds with 'Connection Refused' rather than indicating 
> that it is a standby resource manager.  Because the resource manager is aware 
> of its own state, I feel like we can have the 8032 port open for standby 
> resource managers and reject the request with something like 'Cannot process 
> application submission from this standby resource manager'.  
> This would be especially helpful for debugging oozie problems when users put 
> in the wrong address for the 'jobtracker' (i.e. they don't put the logical RM 
> address but rather point to a specific resource manager).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Rushabh S Shah (JIRA)
Rushabh S Shah created YARN-4045:


 Summary: Negative avaialbleMB is being reported for root queue.
 Key: YARN-4045
 URL: https://issues.apache.org/jira/browse/YARN-4045
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Rushabh S Shah


We recently deployed 2.7 in one of our cluster.
We are seeing negative availableMB being reported for queue=root.
This is from the jmx output:
{noformat}

...
-163328
...

{noformat}

The following is the RM log:
{noformat}
2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=
2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
absoluteUsedCapacity=1.0032743 used= 
cluster=
2015-08-10 14:42:47,401 [ResourceManager Event Processor] INFO 
capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
absoluteUsedCapacity=1.0029854 used= 
cluster=

{noformat}

bq.  used= cluster=
For root queue, usedCapacity is more than totalCapacity





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state

2015-08-11 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682082#comment-14682082
 ] 

Sunil G commented on YARN-3212:
---

Hi [~djp]
I have one doubt in this. For {{StatusUpdateWhenHealthyTransition}}, if state 
of node is DECOMMISSIONING at init state, now we move to DECOMMISIONED 
directly. 
Cud we give a chance to move it to UNHEALTHY here , so later after some rounds 
we can mark as DECOMMISIONED if it cannot be revived. Your thoughts?

> RMNode State Transition Update with DECOMMISSIONING state
> -
>
> Key: YARN-3212
> URL: https://issues.apache.org/jira/browse/YARN-3212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, 
> YARN-3212-v2.patch, YARN-3212-v3.patch, YARN-3212-v4.1.patch, 
> YARN-3212-v4.patch, YARN-3212-v5.1.patch, YARN-3212-v5.patch
>
>
> As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and 
> can transition from “running” state triggered by a new event - 
> “decommissioning”. 
> This new state can be transit to state of “decommissioned” when 
> Resource_Update if no running apps on this NM or NM reconnect after restart. 
> Or it received DECOMMISSIONED event (after timeout from CLI).
> In addition, it can back to “running” if user decides to cancel previous 
> decommission by calling recommission on the same node. The reaction to other 
> events is similar to RUNNING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >