[jira] [Commented] (YARN-4795) ContainerMetrics drops records

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202571#comment-15202571
 ] 

Hadoop QA commented on YARN-4795:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
51s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 23s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 9s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 34m 50s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12793202/YARN-4795.001.patch |
| JIRA Issue | YARN-4795 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 6bf5bc037e48 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 33239c9 |
| Default Java | 1.7.0_95 |
| Multi-JDK versi

[jira] [Commented] (YARN-4815) ATS 1.5 timelineclinet impl try to create attempt directory for every event call

2016-03-18 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198733#comment-15198733
 ] 

Li Lu commented on YARN-4815:
-

Fine... My concern is that we do not need to have separate caches for each use 
cases if they can be modeled by Guava. I'm fine with either ways. 

> ATS 1.5 timelineclinet impl try to create attempt directory for every event 
> call
> 
>
> Key: YARN-4815
> URL: https://issues.apache.org/jira/browse/YARN-4815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4815.1.patch
>
>
> ATS 1.5 timelineclinet impl, try to create attempt directory for every event 
> call. Since per attempt only one call to create directory is enough, this is 
> causing perf issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent

2016-03-18 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202592#comment-15202592
 ] 

Wangda Tan commented on YARN-4002:
--

[~rohithsharma], [~zhiguohong],
Thanks for working on this patch, generally looks good.
Do you think we need to acquire readlock in printConfiguredHosts and 
setDecomissionedNMsMetrics?

> make ResourceTrackerService.nodeHeartbeat more concurrent
> -
>
> Key: YARN-4002
> URL: https://issues.apache.org/jira/browse/YARN-4002
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Critical
> Attachments: 0001-YARN-4002.patch, YARN-4002-lockless-read.patch, 
> YARN-4002-rwlock.patch, YARN-4002-v0.patch
>
>
> We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By 
> design the method ResourceTrackerService.nodeHeartbeat should be concurrent 
> enough to scale for large clusters.
> But we have a "BIG" lock in NodesListManager.isValidNode which I think it's 
> unnecessary.
> First, the fields "includes" and "excludes" of HostsFileReader are only 
> updated on "refresh nodes".  All RPC threads handling node heartbeats are 
> only readers.  So RWLock could be used to  alow concurrent access by RPC 
> threads.
> Second, since he fields "includes" and "excludes" of HostsFileReader are 
> always updated by "reference assignment", which is atomic in Java, the reader 
> side lock could just be skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-18 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201835#comment-15201835
 ] 

Varun Saxena commented on YARN-4712:


bq.  Varun Saxena, will you do the honors of committing this patch?
Sure. WiIl commit it shortly.

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch, YARN-4712-YARN-2928.v1.005.patch, 
> YARN-4712-YARN-2928.v1.006.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4829) Add support for binary units

2016-03-18 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-4829:
---

 Summary: Add support for binary units
 Key: YARN-4829
 URL: https://issues.apache.org/jira/browse/YARN-4829
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev


The units conversion util should have support for binary units.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4766) NM should not aggregate logs older than the retention policy

2016-03-18 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200655#comment-15200655
 ] 

Haibo Chen commented on YARN-4766:
--

fixed the liscense + checkstyle issues. The unit test failure is unrelated to 
the patch. I have create another jira to fix the test failure at 
https://issues.apache.org/jira/browse/YARN-4838

> NM should not aggregate logs older than the retention policy
> 
>
> Key: YARN-4766
> URL: https://issues.apache.org/jira/browse/YARN-4766
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, nodemanager
>Reporter: Haibo Chen
>Assignee: Haibo Chen
> Attachments: yarn4766.001.patch, yarn4766.002.patch
>
>
> When a log aggregation fails on the NM the information is for the attempt is 
> kept in the recovery DB. Log aggregation can fail for multiple reasons which 
> are often related to HDFS space or permissions.
> On restart the recovery DB is read and if an application attempt needs its 
> logs aggregated, the files are scheduled for aggregation without any checks. 
> The log files could be older than the retention limit in which case we should 
> not aggregate them but immediately mark them for deletion from the local file 
> system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197348#comment-15197348
 ] 

Jason Lowe commented on YARN-4814:
--

+1 lgtm, holding off on committing to allow others to comment.  I assume this 
was manually tested to verify the flushes are no longer seen at the filesystem 
level for every write.

> ATS 1.5 timelineclient impl call flush after every event write
> --
>
> Key: YARN-4814
> URL: https://issues.apache.org/jira/browse/YARN-4814
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-4814.1.patch, YARN-4814.2.patch
>
>
> ATS 1.5 timelineclient impl call flush after every event write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4576) Enhancement for tracking Blacklist in AM Launching

2016-03-18 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202584#comment-15202584
 ] 

Wangda Tan commented on YARN-4576:
--

Thanks [~vinodkv] for starting this discussion,

After ramped up most of discussions in this JIRA and related JIRAs, my 
suggestions:
1) AM blacklist is unnecessary to me: 
- When YARN detects *possible* failures, it should blacklist nodes *within the 
app* (from [~sjlee0]). If AM container of an app fails on a node because of 
node-specific reasons, other containers of the app could fail with the same 
reason. But we shouldn't spread it to other apps because different app has 
different settings. We can do this unless we're confident enough that the two 
apps are very similar in configs.
- When YARN detects fatal failures, it should blacklist nodes globally, we mark 
node to be UNHEALTHY. As 
[~djp][commented|https://issues.apache.org/jira/browse/YARN-4576?focusedCommentId=15201559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15201559],
 we may need to fix this issue, if a node's state goes between HEALTHY and 
UNHEALTHY back-and-forth, we need detect it and mark this node to be UNHEALTHY.

2) Framework-specified container blacklists should be transient to end users.
YARN should make correct decisions to select best places for apps, and apps 
should trust YARN's decisions. Just like UNHEALTHY status of a node, it is 
possible that node has 90% of disk utilization is quite acceptable to some 
apps, but we shouldn't allow apps to say: I know it's risky, but I still want 
to schedule on these UNHEALTHY nodes.

3) App should have their own choices to setup preferred nodes, hosts etc.
As Junping commented:
bq. We don't really give application that freedom - where and how to launch 
application's AM container is never be application's business so far, that's 
why we call it out here - give applications the right to set their bar for AM 
launching.
We need this for AMs, I cannot find the original JIRA for AM resource reuqest. 
But I believe there's an open JIRA for this. And I think AM should be able to 
add blacklist nodes with ApplicationSubmissionContext.

> Enhancement for tracking Blacklist in AM Launching
> --
>
> Key: YARN-4576
> URL: https://issues.apache.org/jira/browse/YARN-4576
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: EnhancementAMLaunchingBlacklist.pdf
>
>
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:  
> If AM tried to launch containers on a specific node get failed for several 
> times, AM will blacklist this node in future resource asking. This mechanism 
> works fine for normal containers. However, from our observation on behaviors 
> of several clusters: if this problematic node launch AM failed, then RM could 
> pickup this problematic node to launch next AM attempts again and again that 
> cause application failure in case other functional nodes are busy. In normal 
> case, the customized healthy checker script cannot be so sensitive to mark 
> node as unhealthy when one or two containers get launched failed. 
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes 
> who launching AM attempts failed for specific application before will get 
> blacklisted. To get rid of potential risks that all nodes being blacklisted 
> by BlacklistManager, a disable-failure-threshold is involved to stop adding 
> more nodes into blacklist if hit certain ratio already. 
> There are already some enhancements for this AM blacklist mechanism: 
> YARN-4284 is to address the more wider case for AM container get launched 
> failure and YARN-4389 tries to make configuration settings available for 
> change by App to meet app specific requirement. However, there are still 
> several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated 
> one. The reason is: AM could get more chance to fail if other AM get failed 
> before. A quick example is: in a busy cluster, all nodes are busy except two 
> problematic nodes: node a and node b, app1 already submit and get failed in 
> two AM attempts on a and b. app2 and other apps should wait for other busy 
> nodes rather than waste attempts on these two problematic nodes.
> 2. If AM container failure is recognized as global event instead app own 
> issue, we should consider the blacklist is not a permanent thing but with a 
> specific time window. 
> 3. We could have user defined black list polices to address more possible 
> cases and scenarios, so it reasonable to make blacklist policy pluggable.
> 4. For some test scenario, we

[jira] [Updated] (YARN-998) Persistent resource change during NM/RM restart

2016-03-18 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-998:

Attachment: YARN-998-v2.patch

Update the patch to incorporate with Jian's comments and fix Jenkins whitespace 
and checkstyle issue. 
The unit test failures (TestAMAuthorization and TestClientRMTokens) is not 
related as we have seen many times on trunk (latest same failure in YARN-4785).

> Persistent resource change during NM/RM restart
> ---
>
> Key: YARN-998
> URL: https://issues.apache.org/jira/browse/YARN-998
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, scheduler
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-998-sample.patch, YARN-998-v1.patch, 
> YARN-998-v2.patch
>
>
> When NM is restarted by plan or from a failure, previous dynamic resource 
> setting should be kept for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4830) Add support for resource types in the nodemanager

2016-03-18 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4830:

Description: The RM has support for multiple resource types. The same 
should be added for the NMs.

> Add support for resource types in the nodemanager
> -
>
> Key: YARN-4830
> URL: https://issues.apache.org/jira/browse/YARN-4830
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>
> The RM has support for multiple resource types. The same should be added for 
> the NMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing

2016-03-18 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201655#comment-15201655
 ] 

Sangjin Lee commented on YARN-4837:
---

I just wanted to add my 2 cents to the discussion, specifically about YARN-4284 
where we broadened the cause for blacklisting a node for an AM purpose.

AMs repeatedly getting assigned to the same node in spite of failures is one of 
the most frequent complaints from our users ("why did our AMs keep landing on 
that bad node, causing our jobs to fail?"). If a node is having a "soft" 
failure that doesn't quite trip itself over to an unhealthy state, that's the 
worst possible case. Since the node is still healthy and appears to have a lot 
of available capacity, the chance that it still gets the next attempt is quite 
high; i.e. we have node-affinity. And since this is AM, the consequence is much 
more severe than when a container landed on that node.

Oftentimes, the cause for this soft failure situation is varied, and trying to 
come up with a precise set of exit codes that meet this criteria isn't 
straightforward. There are even error codes like INVALID which we see quite 
often (see [my previous 
comment|https://issues.apache.org/jira/browse/YARN-4284?focusedCommentId=14966248&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14966248]).
 I know it could blacklist the node for the app for reasons such as the app's 
configuration error (false positives). However, the reason we could afford to 
go broad is this blacklisting is *per-app*. The only downside there is to get 
assigned to another node.

We have a number of large busy clusters, and we're using this with success and 
with little downside.

That said, I do recognize that this could be a problem if 
{{yarn.resourcemanager.am.max-attempts}} is larger than the size of the cluster.

> User facing aspects of 'AM blacklisting' feature need fixing
> 
>
> Key: YARN-4837
> URL: https://issues.apache.org/jira/browse/YARN-4837
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed 
> before we release it in 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4831) Recovered containers will be killed after NM stateful restart

2016-03-18 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li reassigned YARN-4831:
-

Assignee: Siqi Li

> Recovered containers will be killed after NM stateful restart 
> --
>
> Key: YARN-4831
> URL: https://issues.apache.org/jira/browse/YARN-4831
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siqi Li
>Assignee: Siqi Li
> Attachments: YARN-4831.v1.patch
>
>
> {code}
> 2016-03-04 19:43:48,130 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1456335621285_0040_01_66 transitioned from NEW to 
> DONE
> 2016-03-04 19:43:48,130 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=henkins-service 
>OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1456335621285_0040
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client

2016-03-18 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197858#comment-15197858
 ] 

Naganarasimha G R commented on YARN-4711:
-

Hi [~sjlee0], 
Following issues needs to be fixed as part of this jira :
# Stop Retry on all exceptions
# Async calls for metrics needs to implemented
# Possible NPE in NMTimelinePublisher$ContainerEventHandler
# Possible NPE in putEntity(NMTimelinePublisher.java:213)
# Possible NPE in TimelineEntity.toString() when real is not null

and i wanted discuss in particular to Issue 3, It seems now after 3367 whether 
its required to queue in Dispatcher when we are already queuing in 
TimelineClient for NM side events. On further thinking thought it will be 
required for only sync events as for example 
 {{ContainerManagerImpl.ContainerEventDispatcher.handle(ContainerEvent)}} 
should not be *blocked* on sync timeline publish event. But one flaw in current 
handling is, we put the event into *nmMetricsPublisher.dispatcher* in 
{{nmMetricsPublisher.publishContainerEvent}}, due to this if there is any delay 
in the dispatcher we might miss to publish the events.(NPE we got as in 3 point 
). 
So my idea was to  create a ATS event to be published and wrap it in the 
existing *TimelinePublishEvent* and put in the dispatcher and for metrics 
directly get the apps TimelineClient and call *putEntitiesAsync*. Thoughts?  

> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> 
>
> Key: YARN-4711
> URL: https://issues.apache.org/jira/browse/YARN-4711
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: 4711Analysis.txt
>
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-998) Persistent resource change during NM/RM restart

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197900#comment-15197900
 ] 

Hadoop QA commented on YARN-998:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 2 new + 54 unchanged - 12 fixed = 56 total (was 66) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 30s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 16s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 163m 5s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_74 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | 

[jira] [Updated] (YARN-4743) ResourceManager crash because TimSort

2016-03-18 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated YARN-4743:
-
Attachment: YARN-4743-cdh5.4.7.patch

> ResourceManager crash because TimSort
> -
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Zephyr Guo
>Assignee: Yufei Gu
> Attachments: YARN-4743-cdh5.4.7.patch
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>  at java.util.TimSort.mergeHi(TimSort.java:868)
>  at java.util.TimSort.mergeAt(TimSort.java:485)
>  at java.util.TimSort.mergeCollapse(TimSort.java:410)
>  at java.util.TimSort.sort(TimSort.java:214)
>  at java.util.TimSort.sort(TimSort.java:173)
>  at java.util.Arrays.sort(Arrays.java:659)
>  at java.util.Collections.sort(Collections.java:217)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>  at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator comparator = policy.getComparator();
> writeLock.lock();
> try {
>   Collections.sort(runnableApps, comparator);
> } finally {
>   writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ..
>   s1.getResourceUsage(), minShare1);
>   boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>   s2.getResourceUsage(), minShare2);
>   minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>   minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>   / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ..
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
> return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ..
> Resources.addTo(currentConsumption, rmContainer.getContainer()
>   .getResource());
> ..
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199604#comment-15199604
 ] 

Hadoop QA commented on YARN-4062:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 49s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
14s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
40s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
46s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
19s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 21s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 39s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
9s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 55s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 55s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 20s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 2 new + 
212 unchanged - 1 fixed = 214 total (was 213) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
35s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
48s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 4m 5s {color} | 
{color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:

[jira] [Commented] (YARN-3926) Extend the YARN resource model for easier resource-type management and profiles

2016-03-18 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197763#comment-15197763
 ] 

Varun Vasudev commented on YARN-3926:
-

bq. That should work... But I feel, maybe allow mismatch should be the default. 
If NM has a super-set of RMs resource types, it will just be ignored, If 
sub-set, then for those specific resource-types, RM will assign a 0 value for 
the NM.

I don't have any particular preference - I can see scenarios for all 3. I'm 
fine with making allow mismatch the default.

bq. We can also add admin API on the RM to add / remove allowable resource 
types on the fly.

This should be do-able but we need to go through the affect on running apps.

> Extend the YARN resource model for easier resource-type management and 
> profiles
> ---
>
> Key: YARN-3926
> URL: https://issues.apache.org/jira/browse/YARN-3926
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: Proposal for modifying resource model and profiles.pdf
>
>
> Currently, there are efforts to add support for various resource-types such 
> as disk(YARN-2139), network(YARN-2140), and  HDFS bandwidth(YARN-2681). These 
> efforts all aim to add support for a new resource type and are fairly 
> involved efforts. In addition, once support is added, it becomes harder for 
> users to specify the resources they need. All existing jobs have to be 
> modified, or have to use the minimum allocation.
> This ticket is a proposal to extend the YARN resource model to a more 
> flexible model which makes it easier to support additional resource-types. It 
> also considers the related aspect of “resource profiles” which allow users to 
> easily specify the various resources they need for any given container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4576) Enhancement for tracking Blacklist in AM Launching

2016-03-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201559#comment-15201559
 ] 

Junping Du commented on YARN-4576:
--

bq.   this whole "AM blacklisting" feature is unnecessarily blown way out of 
proportion - we just don't need this amount of complexity.
Are we not seeing the failure price for AM container could be different with 
normal containers? If so, a separate blacklist for AM container launching with 
normal container launching make a lot of sense to me. Actually, we can also 
call it grey list - a special status between good and bad, risky to launch AM 
container but ok to launch normal containers. I would buy in this complexity 
for addressing more different scenarios - in case they are solid.

bq. Containers are marked DISKS_FAILED only if all the disks have become bad, 
in which case the node itself becomes unhealthy. So there is no need for 
blacklisting per app at all !!
No. DISKS_FAILED mark on bad disks are transient status for the node. Take an 
example, if 
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" 
is set to 90% (by default), another job (no matter YARN or not) writing some 
files and deleting them afterwards back-and-forth - if disks usage for the node 
is just happen to be around 90%, it make NM's healthy status report to RM 
between healthy and unhealthy back and forth. Blacklist of AM launching can 
evaluate history record to decide a better place to launch AM. The bar for 
launching normal containers could be different or we could end up with so less 
choice.

bq. If an AM is killed due to memory over-flow, blacklisting the node will not 
help at all!
I agree that if memory over-flow issue is due to AM asking for more than 
expected resource, then there is no different to schedule to other nodes. 
However, some of memory issues is caused by memory congestion of nodes that 
make AM can be good for other nodes which has different memory resources and 
settings.

bq. When YARN finds a node with configuration / permission issues, it should 
itself take an action to (a) avoid scheduling on that node, (b) alert 
administrators etc.
If YARN know it is a configuration/permission issue, we definitely should do a) 
and b). Sometimes the runtime container failure is intermittently (like due to 
disk, memory resource, etc.), the bar between launching AM and normal 
containers is nice to be different.

bq. But that isn't the case with YARN - part of the reason why we never 
implemented heuristics based per-app blacklisting in YARN - we left that 
completely up to applications.
We don't really give application that freedom - where and how to launch 
application's AM container is never be application's business so far, that's 
why we call it out here - give applications the right to set their bar for AM 
launching.

> Enhancement for tracking Blacklist in AM Launching
> --
>
> Key: YARN-4576
> URL: https://issues.apache.org/jira/browse/YARN-4576
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: EnhancementAMLaunchingBlacklist.pdf
>
>
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:  
> If AM tried to launch containers on a specific node get failed for several 
> times, AM will blacklist this node in future resource asking. This mechanism 
> works fine for normal containers. However, from our observation on behaviors 
> of several clusters: if this problematic node launch AM failed, then RM could 
> pickup this problematic node to launch next AM attempts again and again that 
> cause application failure in case other functional nodes are busy. In normal 
> case, the customized healthy checker script cannot be so sensitive to mark 
> node as unhealthy when one or two containers get launched failed. 
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes 
> who launching AM attempts failed for specific application before will get 
> blacklisted. To get rid of potential risks that all nodes being blacklisted 
> by BlacklistManager, a disable-failure-threshold is involved to stop adding 
> more nodes into blacklist if hit certain ratio already. 
> There are already some enhancements for this AM blacklist mechanism: 
> YARN-4284 is to address the more wider case for AM container get launched 
> failure and YARN-4389 tries to make configuration settings available for 
> change by App to meet app specific requirement. However, there are still 
> several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated 
> one. The reason is: AM could get more chance to fail if other AM get fa

[jira] [Commented] (YARN-4636) Make blacklist tracking policy pluggable for more extensions.

2016-03-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201373#comment-15201373
 ] 

Junping Du commented on YARN-4636:
--

bq. -1 for something like this without understanding the use-cases.
We should ask for user cases first before making -1 decision.

bq. IMO, the "AM blacklisting" doesn't even need to be user-visible (YARN-4837) 
let alone be pluggable.
Pluggable blacklist policy is necessary because application's requirement for 
AM robust is different. Some app can tolerant AM failure (small and 
short-running job) but some apps don't want any risk (like a large MR job with 
long running reducer jobs - AM restart will kill reducer tasks no mater how 
long it is already running). IMO, Allowing various blacklist policies is a good 
thing for YARN to show the extension capability to address different 
application's requirement especially for a cluster form of heterogeneous nodes. 
Any comments from guys in watching list?

> Make blacklist tracking policy pluggable for more extensions.
> -
>
> Key: YARN-4636
> URL: https://issues.apache.org/jira/browse/YARN-4636
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Sunil G
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing

2016-03-18 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202323#comment-15202323
 ] 

Sangjin Lee commented on YARN-4837:
---

Yes, [~vinodkv], I agree with the direction you laid out.

> User facing aspects of 'AM blacklisting' feature need fixing
> 
>
> Key: YARN-4837
> URL: https://issues.apache.org/jira/browse/YARN-4837
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed 
> before we release it in 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-03-18 Thread Eric Badger (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199736#comment-15199736
 ] 

Eric Badger commented on YARN-4686:
---

[~kasha] When HA is not enabled, the RM will be transitioned to active in the 
RM serviceStart method. So it is not necessary to transition it to active in 
the MiniYARNCluster serviceStart method. However, the transitionToActive method 
will not end up actually transitioning the RM again since it checks to make 
sure that it is not already active before proceeding. I believe that this makes 
the MiniYARNCluster start method a little bit cleaner since there are no extra 
checks. But, it may be more intuitive for the check to be in the 
MiniYARNCluster start so that it is obvious that the RM is only explicitly 
transitioned to active in HA setups. The check is made either way, it's just a 
matter of where we want the check to occur. Currently it is in the RM start 
method, but if you feel that it is better to put it in the MiniYARNCluster 
start method then we can add the check there. 

> MiniYARNCluster.start() returns before cluster is completely started
> 
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Eric Badger
> Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch, 
> YARN-4686.002.patch, YARN-4686.003.patch, YARN-4686.004.patch, 
> YARN-4686.005.patch, YARN-4686.006.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-03-18 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197826#comment-15197826
 ] 

Kuhu Shukla commented on YARN-4311:
---

Requesting Jason Lowe, Daniel Templeton for review/comments. Thanks a lot!

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch, YARN-4311-v10.patch, 
> YARN-4311-v11.patch, YARN-4311-v11.patch, YARN-4311-v2.patch, 
> YARN-4311-v3.patch, YARN-4311-v4.patch, YARN-4311-v5.patch, 
> YARN-4311-v6.patch, YARN-4311-v7.patch, YARN-4311-v8.patch, YARN-4311-v9.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4746) yarn web services should convert parse failures of appId to 400

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200770#comment-15200770
 ] 

Hadoop QA commented on YARN-4746:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 58s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 58s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
35s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 39s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 12s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 50s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 26s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 8s {color} 
| {color:red} hadoop-yarn-serve

[jira] [Commented] (YARN-4699) Scheduler UI and REST o/p is not in sync when -replaceLabelsOnNode is used to change label of a node

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200371#comment-15200371
 ] 

Hadoop QA commented on YARN-4699:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 3s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 53s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 153m 21s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_74 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12789619/0001-YARN-4699.pat

[jira] [Updated] (YARN-4390) Consider container request size during CS preemption

2016-03-18 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4390:
-
Issue Type: Sub-task  (was: Bug)
Parent: YARN-45

> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Wangda Tan
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-4829) Add support for binary units

2016-03-18 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197449#comment-15197449
 ] 

Varun Vasudev edited comment on YARN-4829 at 3/16/16 3:12 PM:
--

Attached a file with the fix. [~asuresh] - do you mind doing a review? The 
patch is for the YARN-3926 branch.


was (Author: vvasudev):
Attached a file with the fix.

> Add support for binary units
> 
>
> Key: YARN-4829
> URL: https://issues.apache.org/jira/browse/YARN-4829
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4829-YARN-3926.001.patch
>
>
> The units conversion util should have support for binary units.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4833) For Queue AccessControlException client retries multiple times on both RM

2016-03-18 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202550#comment-15202550
 ] 

Sunil G commented on YARN-4833:
---

Hi [~bibinchundatt] and [~jianhe]

I have some doubts here. {{AccessControlException}} is used in various places 
and in many places its been caught from caller as IOException. May be I didnt 
fully understand the point 2 in the mentioned solution "Wrap AccessControl 
exception to YarnException". Do you mean that instead of throwing 
AccessControlException, it will be thrown as {{YarnExcpetion}} with the  
message from AccessControlException OR are u planning to have 
AccessControlException as YarnException in its inheritance hierarchy?

Since {{submitApplication}} is user facing, I am not very sure about change to 
YarnException in these ACL specific cases.

RPC RemoteException is handled with correct RetryPolicy. So is it possible to 
throw AccessControlException wrapped with RPC RemotException. May  be I 
missed/overlooked something, pls help to correct me if I am wrong.

> For Queue AccessControlException client retries multiple times on both RM
> -
>
> Key: YARN-4833
> URL: https://issues.apache.org/jira/browse/YARN-4833
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Submit application to queue where ACL is enabled and submitted user is not  
> having access. Client retries till failMaxattempt 10 times.
> {noformat}
> 16/03/18 10:01:06 INFO retry.RetryInvocationHandler: Exception while invoking 
> submitApplication of class ApplicationClientProtocolPBClientImpl over rm1. 
> Trying to fail over immediately.
> org.apache.hadoop.security.AccessControlException: User hdfs does not have 
> permission to submit application_1458273884145_0001 to queue default
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:380)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:291)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2360)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2356)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2356)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:257)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
> at com.sun.proxy.$Proxy23.submitApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:261)
> at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDe

[jira] [Updated] (YARN-4694) Document ATS v1.5

2016-03-18 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-4694:

Assignee: Li Lu  (was: Xuan Gong)

> Document ATS v1.5
> -
>
> Key: YARN-4694
> URL: https://issues.apache.org/jira/browse/YARN-4694
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Li Lu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4815) ATS 1.5 timelineclinet impl try to create attempt directory for every event call

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202365#comment-15202365
 ] 

Hadoop QA commented on YARN-4815:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 22s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
48s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 44s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
38s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
33s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 34s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
54s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 45s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 45s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 26s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 3 new + 
215 unchanged - 0 fixed = 218 total (was 215) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
22s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
56s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 31s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 25s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 22s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {c

[jira] [Commented] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing

2016-03-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202500#comment-15202500
 ] 

Junping Du commented on YARN-4837:
--

Hi [~vinodkv], did you see my comments at YARN-4576 
(https://issues.apache.org/jira/browse/YARN-4576?focusedCommentId=15201559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15201559)?
+1 on "explicitly treat known exit-codes" which is exactly the same as previous 
proposal in YARN-4576. However, the different is:
"DISKS_FAILED" shouldn't be skipped for the reason I mentioned in YARN-4576. 
Also, we cannot simply judge system innocent when hitting memory issues.
Also, hide all AM scheduling info/preference from application doesn't make 
sense in long time: AM can ask for resources for its running containers in the 
beginning, but application cannot ask how to place its AM even today which is 
sad to me.
YARN-4685 is something fixable and much better than the age without blacklist 
(we do see AM keep launching on bad nodes repeatedly and get stuck in many 
cases). We just need to go ahead to fix YARN-4685.

> User facing aspects of 'AM blacklisting' feature need fixing
> 
>
> Key: YARN-4837
> URL: https://issues.apache.org/jira/browse/YARN-4837
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed 
> before we release it in 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation

2016-03-18 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202391#comment-15202391
 ] 

Bikas Saha commented on YARN-1040:
--

This design doc effectively looks like a re-design of almost all core semantics 
of YARN. This probably deserves a wider discussion on the dev email list and 
under its own jira. Although it covers YARN-1040 and YARN-4726 the scope looks 
much wider and careful thinking about backwards compatibility is needed etc. 
Conceptually this changes the current semantic understanding of allocation and 
container thats widely understood externally. I am afraid that this jira or 
just the folks on this thread are not enough to make a decision for the given 
proposal.

As far as this jira is concerned, both the previous (say a) & new (say b) 
proposals sound similar with startContainer_in_a renamed to 
startAllocation_in_b & startProcess_in_a renamed to startContainer_in_b. So we 
may be fine in that restricted part minus the renamings.

> De-link container life cycle from an Allocation
> ---
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
> Attachments: YARN-1040-rough-design.pdf
>
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2016-03-18 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202256#comment-15202256
 ] 

Karthik Kambatla commented on YARN-810:
---

Hasn't this been recently added with strict cpu usage? 

> Support CGroup ceiling enforcement on CPU
> -
>
> Key: YARN-810
> URL: https://issues.apache.org/jira/browse/YARN-810
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.1.0-beta, 2.0.5-alpha
>Reporter: Chris Riccomini
>Assignee: Wei Yan
>  Labels: BB2015-05-TBR
> Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810-5.patch, 
> YARN-810-6.patch, YARN-810.patch, YARN-810.patch
>
>
> Problem statement:
> YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
> Containers are then allowed to request vcores between the minimum and maximum 
> defined in the yarn-site.xml.
> In the case where a single-threaded container requests 1 vcore, with a 
> pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
> the core it's using, provided that no other container is also using it. This 
> happens, even though the only guarantee that YARN/CGroups is making is that 
> the container will get "at least" 1/4th of the core.
> If a second container then comes along, the second container can take 
> resources from the first, provided that the first container is still getting 
> at least its fair share (1/4th).
> There are certain cases where this is desirable. There are also certain cases 
> where it might be desirable to have a hard limit on CPU usage, and not allow 
> the process to go above the specified resource requirement, even if it's 
> available.
> Here's an RFC that describes the problem in more detail:
> http://lwn.net/Articles/336127/
> Solution:
> As it happens, when CFS is used in combination with CGroups, you can enforce 
> a ceiling using two files in cgroups:
> {noformat}
> cpu.cfs_quota_us
> cpu.cfs_period_us
> {noformat}
> The usage of these two files is documented in more detail here:
> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
> Testing:
> I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
> it behaves as described above (it is a soft cap, and allows containers to use 
> more than they asked for). I then tested CFS CPU quotas manually with YARN.
> First, you can see that CFS is in use in the CGroup, based on the file names:
> {noformat}
> [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
> total 0
> -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
> drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
> -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
> -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
> -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
> -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
> -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
> -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
> -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
> -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
> [criccomi@eat1-qa464 ~]$ sudo -u app cat
> /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
> 10
> [criccomi@eat1-qa464 ~]$ sudo -u app cat
> /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
> -1
> {noformat}
> Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
> We can place processes in hard limits. I have process 4370 running YARN 
> container container_1371141151815_0003_01_03 on a host. By default, it's 
> running at ~300% cpu usage.
> {noformat}
> CPU
> 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
> {noformat}
> When I set the CFS quote:
> {noformat}
> echo 1000 > 
> /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
>  CPU
> 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
> {noformat}
> It drops to 1% usage, and you can see the box has room to spare:
> {noformat}
> Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
> 0.0%st
> {noformat}
> Turning the quota back to -1:
> {noformat}
> echo -1 > 
> /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
> {noformat}
> Burns the cores again:
> {noformat}
> Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
> 0.0%st
> CPU
> 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
> {noformat}
> On my dev box, I was testing CGroups by running a python process eight times, 
> to burn t

[jira] [Created] (YARN-4840) Add option to upload files recursively from container directory

2016-03-18 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4840:


 Summary: Add option to upload files recursively from container 
directory
 Key: YARN-4840
 URL: https://issues.apache.org/jira/browse/YARN-4840
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Brook Zhou
Priority: Minor
 Fix For: 2.8.0


It may be useful to allow users to aggregate their logs recursively from 
container directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4595) Add support for configurable read-only mounts

2016-03-18 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200076#comment-15200076
 ] 

Allen Wittenauer commented on YARN-4595:


What's preventing users from mounting files and file systems they shouldn't 
have access to?

> Add support for configurable read-only mounts
> -
>
> Key: YARN-4595
> URL: https://issues.apache.org/jira/browse/YARN-4595
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
> Attachments: YARN-4595.1.patch, YARN-4595.2.patch
>
>
> Mounting files or directories from the host is one way of passing 
> configuration and other information into a docker container.  We could allow 
> the user to set a list of mounts in the environment of ContainerLaunchContext 
> (e.g. /dir1:/targetdir1,/dir2:/targetdir2).  These would be mounted read-only 
> to the specified target locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4823) Refactor the nested reservation id field in listReservation to simple string field

2016-03-18 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4823:
-
Description: The listReservation REST API returns a ReservationId field 
which has a nested id field which is also called ReservationId. This JIRA 
proposes to rename the nested field to a string as it's easier to read and 
moreover what the update/delete APIs take in as input.  (was: The 
listReservation REST API returns a ReservationId field which has a nested id 
field which is also called ReservationId. This JIRA proposes to rename the 
nested field to id.)

> Refactor the nested reservation id field in listReservation to simple string 
> field
> --
>
> Key: YARN-4823
> URL: https://issues.apache.org/jira/browse/YARN-4823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>
> The listReservation REST API returns a ReservationId field which has a nested 
> id field which is also called ReservationId. This JIRA proposes to rename the 
> nested field to a string as it's easier to read and moreover what the 
> update/delete APIs take in as input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4785) inconsistent value type of the "type" field for LeafQueueInfo in response of RM REST API - cluster/scheduler

2016-03-18 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4785:

Attachment: YARN-4785.branch-2.6.001.patch
YARN-4785.branch-2.7.001.patch

Thanks for the reviews [~djp] and [~jhsenjaliya]! Junping - I've attached 
versions of the patch for 2.6 and 2.7. Can you please commit them?

> inconsistent value type of the "type" field for LeafQueueInfo in response of 
> RM REST API - cluster/scheduler
> 
>
> Key: YARN-4785
> URL: https://issues.apache.org/jira/browse/YARN-4785
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.6.0
>Reporter: Jayesh
>Assignee: Varun Vasudev
>  Labels: REST_API
> Attachments: YARN-4785.001.patch, YARN-4785.branch-2.6.001.patch, 
> YARN-4785.branch-2.7.001.patch
>
>
> I see inconsistent value type ( String and Array ) of the "type" field for 
> LeafQueueInfo in response of RM REST API - cluster/scheduler
> as per the spec it should be always String.
> here is the sample output ( removed non-relevant fields )
> {code}
> {
>   "scheduler": {
> "schedulerInfo": {
>   "type": "capacityScheduler",
>   "capacity": 100,
>   ...
>   "queueName": "root",
>   "queues": {
> "queue": [
>   {
> "type": "capacitySchedulerLeafQueueInfo",
> "capacity": 0.1,
> 
>   },
>   {
> "type": [
>   "capacitySchedulerLeafQueueInfo"
> ],
> "capacity": 0.1,
> "queueName": "test-queue",
> "state": "RUNNING",
> 
>   },
>   {
> "type": [
>   "capacitySchedulerLeafQueueInfo"
> ],
> "capacity": 2.5,
> 
>   },
>   {
> "capacity": 25,
> 
> "state": "RUNNING",
> "queues": {
>   "queue": [
> {
>   "capacity": 6,
>   "state": "RUNNING",
>   "queues": {
> "queue": [
>   {
> "type": "capacitySchedulerLeafQueueInfo",
> "capacity": 100,
> ...
>   }
> ]
>   },
>   
> },
> {
>   "capacity": 6,
>   ...
>   "state": "RUNNING",
>   "queues": {
> "queue": [
>   {
> "type": "capacitySchedulerLeafQueueInfo",
> "capacity": 100,
> ...
>   }
> ]
>   },
>   ...
> },
> ...
>   ]
> },
> ...
>   }
> ]
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928

2016-03-18 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201695#comment-15201695
 ] 

Sangjin Lee commented on YARN-4712:
---

For the record (and following up on yesterday's offline discussion), I am +1 on 
the patch. If there are different opinions on this, we can reopen the 
discussion. Thanks! [~varun_saxena], will you do the honors of committing this 
patch?

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch, YARN-4712-YARN-2928.v1.005.patch, 
> YARN-4712-YARN-2928.v1.006.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4833) For Queue AccessControlException client retries multiple times on both RM

2016-03-18 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4833:
--

 Summary: For Queue AccessControlException client retries multiple 
times on both RM
 Key: YARN-4833
 URL: https://issues.apache.org/jira/browse/YARN-4833
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Submit application to queue where ACL is enabled and submitted user is not  
having access. Client retries till failMaxattempt 10 times.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4829) Add support for binary units

2016-03-18 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198012#comment-15198012
 ] 

Arun Suresh commented on YARN-4829:
---

Thanks, LGTM
+1 pending Jenkins

> Add support for binary units
> 
>
> Key: YARN-4829
> URL: https://issues.apache.org/jira/browse/YARN-4829
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4829-YARN-3926.001.patch, 
> YARN-4829-YARN-3926.002.patch, YARN-4829-YARN-3926.003.patch
>
>
> The units conversion util should have support for binary units.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters

2016-03-18 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199524#comment-15199524
 ] 

Varun Vasudev commented on YARN-4820:
-

bq. I was just looking at the redirect logic and noting it was looking at 302's 
only.

[~steve_l] - can you point me to where you found this logic? The logic in 
RMWebAppFilter.java is that if the node is a standby, it redirects all requests 
to the active.



> ResourceManager web redirects in HA mode drops query parameters
> ---
>
> Key: YARN-4820
> URL: https://issues.apache.org/jira/browse/YARN-4820
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4820.001.patch, YARN-4820.002.patch
>
>
> The RMWebAppFilter redirects http requests from the standby to the active. 
> However it drops all the query parameters when it does the redirect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4607) AppAttempt page TotalOutstandingResource Requests table support pagination

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202269#comment-15202269
 ] 

Hadoop QA commented on YARN-4607:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 52s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
25s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 57s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
50s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
23s {color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: 
patch generated 0 new + 21 unchanged - 2 fixed = 21 total (was 23) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
23s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m 53s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 53s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:gr

[jira] [Commented] (YARN-4815) ATS 1.5 timelineclinet impl try to create attempt directory for every event call

2016-03-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202431#comment-15202431
 ] 

Junping Du commented on YARN-4815:
--

v2 patch LGTM. However, can we fix checkstyle issues?

> ATS 1.5 timelineclinet impl try to create attempt directory for every event 
> call
> 
>
> Key: YARN-4815
> URL: https://issues.apache.org/jira/browse/YARN-4815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4815.1.patch, YARN-4815.2.patch
>
>
> ATS 1.5 timelineclinet impl, try to create attempt directory for every event 
> call. Since per attempt only one call to create directory is enough, this is 
> causing perf issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4746) yarn web services should convert parse failures of appId to 400

2016-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202503#comment-15202503
 ] 

Hadoop QA commented on YARN-4746:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 41s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 53s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 1s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
57s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
51s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 44s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 47s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
47s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 55s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 10s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 48s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 17s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s 
{color} | {color:gree