date:20160407


[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231626#comment-15231626
 ] 

Sangjin Lee commented on YARN-3816:
---

Onto code-level comments...

First, there seem to be checkstyle violations and javadoc errors. Could you 
please fix them?

(RealTimeAggregationOperation.java)
- As mentioned in the above comment, this really appears to be about 
"accumulation". We should rename things here to "accumulation".
- l.36: We don’t need to update {{state}} for MAX? Could you explain how 
{{state}} is supposed to be used?
- I don’t think I understand {{SUM.exec()}}. Maybe some comment in the code (or 
a JIRA comment) could be helpful.
- l.116: There is no need for a separate interface ({{Operation}}). The 
{{exec()}} method can simply belong in {{RealTimeAggregationOperation}} itself.

(TimelineMetric.java)
- l.105: This is an unrelated issue with this patch, but I’m not sure what’s 
going on with the else clause in l.104-106 in the {{setValues()}} method. Could 
you look at it and fix it if it is not right?
- l.183: we should use {{StringBuilder}} (unsynchronized) over {{StringBuffer}} 
(synchronized)
- l.191: I would say use “get” instead of “retrieve” for these method names...
- l.192: nit: since this is an enum, {{==}} is sufficient (no need for 
{{equals()}}); the same for l.206 and 220
- l.196: It should be {{firstKey()}} because it’s reverse sorted, right? We’re 
looking for the latest timestamp.
- l.205: the name “key” is bit obscure. What we mean is the timestamp for the 
value. Should we rename this to {{getSingleDataTimestamp()}}?

(TimelineMetricCalculator.java)
- l.38: typo: “Number be compared” -> “Number to be compared”. The same with 
l.71
- l.41: nit: need a space before the opening brace
- l.76: same as above
- l.68: We stated that we will support only longs as the metric value type for 
now (and maybe double later). In any case, I think it’s safe to say we need not 
support ints. Should we simplify this by casting ints to longs if we see them?
- l.109: do we need to check for both being null?
- l.145: I think we should check to ensure time > 0. Also, it might be easier 
if we specify time as {{long}} instead of {{Long}}.
- l.151: wouldn’t it be easier if we called {{sum()}} to handle the summation 
part instead of implementing the summing logic here again?
- l.194: nit: space before the brace

(TimelineCollector.java)
- l.59-69: nit: let’s group all statics at the beginning and place instance 
members after them
- the executor should be shut down properly in {{serviceStop()}}, or it will 
leave those threads hanging around
- l.129: nit: we don’t need to specify {{TimelineCollector}} in calling the 
static methods (in several places here)
- l.218: nit: let’s surround it with {{LOG.isDebugEnabled()}}
- l.237-241: This is bit of an anti-pattern for using a {{ConcurrentHashMap}}. 
The issue is if multiple threads find that {{aggrRow}} is null and try to put 
their copies to the {{aggregateTable}} map, there is a race. As a result, you 
may start operating on an instance that will not be stored in the map 
eventually. You should use the {{putIfAbsent()}} method to make sure multiple 
threads always agree on the stored instance after the operation.
- l.247: nit: let’s use ==
- l.258: nit: let’s use ==

(TimelineReaderWebServices.java)
- Are the imports needed? There are no other code changes in this file?

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-l

[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics


[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231612#comment-15231612
 ] 

Sangjin Lee commented on YARN-3816:
---

[~gtCarrera9], thanks much for posting an updated patch for this! I just had an 
opportunity to go over it fairly completely once, and have some high-level 
comments as well as more detailed code feedback.

Starting with high-level comments:
1. "aggregation" v. "accumulation"
This came up several times on this JIRA, and I think the distinction is crucial 
in getting this completed. I believe what we agreed on is as follows: 
"aggregation" is about rolling up metrics from a child type to a parent type 
(e.g. rolling up metrics from containers to applications), and "accumulation" 
is about computing/deriving secondary values based on the *time dimension* 
(e.g. area under the curve or the running maximum). Those two are rather 
independent, and we should not mix them.

Unfortunately in the latest patch, these two terms are used very much 
interchangeably. Can we make that distinction clear and rename all the 
classes/methods/variables that pertain to accumulation from "aggregation" to 
"accumulation"? It would be good if we reserve "aggregation" to child-to-parent 
rollups.

2. container-to-application aggregation
Related to above, this JIRA was meant to implement 2 features: (1) 
"aggregating" metrics from containers to applications, and (2) "accumulating" 
metrics for (certain) entity types. Both should be done. However, in the latest 
patch, *I do not see (1) being done*. In other words, I didn't find code that 
rolls up metrics from the container entities and sets them to the parent 
application entities. Am I missing something? The previous patches did 
implement that. Without this, we will *NOT* see things like container CPU or 
memory being rolled up to applications, and as a consequence to flow runs, and 
so on. This is a MUST.

IMO that is a separate functionality from the accumulation. I think we should 
do it clearly and explicitly. And the rolled-up metrics should be set onto the 
application entities.

3. time-based accumulation
We also said that the time-based accumulation should be conditional on a 
configuration (see [the previous 
patch|https://issues.apache.org/jira/secure/attachment/12761120/YARN-3816-YARN-2928-v4.patch]).
 I see that condition is not there in the latest patch. Can we please make the 
accumulation conditional on that configuration?

Also, this was an issue with the previous patches and I think it exists with 
the latest patch. It appears that we are doing the time-based accumulation for 
*all metrics for all entity types*. We might want to think about whether that 
would be OK. There are some performance and storage implications in doing so. 
Also, I raised some semantic issues with that idea. See the previous comment 
[here|https://issues.apache.org/jira/browse/YARN-3816?focusedCommentId=15067321&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15067321].
 I'm not 100% certain if the latest patch has the same issue or not although I 
suspect it might.

4. new YARN_APPLICATION_AGGREGATION entity type
I also raised a concern whether we should use a separate entity type for this. 
First of all, the "aggregation" (from containers to applications) *should* go 
to the actual application type. Second, even for "accumulation" you might want 
to think about what you want to do. I assume that the accumulated metrics 
(YARN_APPLICATION_AGGREGATION) are being written to the entities table. Note 
that they are not really considered as part of the application, and are not 
available for application queries. So there is an implication for queries. And 
they are not going to be aggregated up to the flow runs.

I know this is a lot to parse, and obviously there is much history in this 
discussion. However, it would help to replay the main discussions up to this 
point so that we don't lose these important points. Thanks much!

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.p

[jira] [Updated] (YARN-4865) Track Reserved resources in ResourceUsage and QueueCapacities

2016-04-07 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4865:
--
Attachment: 0003-YARN-4865-addendum.patch

Thank u [~leftnoteasy] and [~karams]

Attached new patch. Test case will cover below scenario.
- One container is reserved for app2 in node1
- Killed a running container of app1, thus making enough space in node1 for 
app2 container.
- Reserved container became allocated. Verified the new metrics against the 
same.

Pls suggest if this is fine or not. I will raise another ticket to handle cases 
like node removed etc.

> Track Reserved resources in ResourceUsage and QueueCapacities 
> --
>
> Key: YARN-4865
> URL: https://issues.apache.org/jira/browse/YARN-4865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4865.patch, 0002-YARN-4865.patch, 
> 0003-YARN-4865-addendum.patch, 0003-YARN-4865.patch
>
>
> As discussed in YARN-4678, capture reserved capacity separately in 
> QueueCapcities for better tracking. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4425) Pluggable sharing policy for Partition Node Label resources


[ 
https://issues.apache.org/jira/browse/YARN-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231438#comment-15231438
 ] 

Hadoop QA commented on YARN-4425:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} 
| {color:red} YARN-4425 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12780557/YARN-4425.20160105-1.patch
 |
| JIRA Issue | YARN-4425 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/10989/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> Pluggable sharing policy for Partition Node Label resources
> ---
>
> Key: YARN-4425
> URL: https://issues.apache.org/jira/browse/YARN-4425
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: ResourceSharingPolicyForNodeLabelsPartitions-V1.pdf, 
> ResourceSharingPolicyForNodeLabelsPartitions-V2.pdf, 
> YARN-4425.20160105-1.patch
>
>
> As part of support for sharing NonExclusive Node Label partitions in 
> YARN-3214, NonExclusive partitions are shared only to Default Partitions and 
> also have fixed rule when apps in default partitions makes use of resources 
> of any NonExclusive partitions.
> There are many scenarios where in we require pluggable policy like 
> MutliTenant, Hierarchical etc.. where in each partition can determine when 
> they want to share the resources to other paritions and when other partitions 
> wants to use resources from others
> More details in the attached document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4425) Pluggable sharing policy for Partition Node Label resources


[ 
https://issues.apache.org/jira/browse/YARN-4425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231431#comment-15231431
 ] 

Wangda Tan commented on YARN-4425:
--

[~Naganarasimha], [~xinxianyin],
Read the doc and took a very high level look at the patch, sorry for the huge 
delays. Some thoughts:

The new added policy seems like a backdoor of scheduler's capacity management: 
when under "NON_EXCLUSIVE" mode, scheduler completely depends on configured 
policy to decide who can get next resources.

And we need to make API of policy with more clear semantic: In existing 
scheduler, resource will be allocated on requested partition, and only 
request.partition = "" can get chance to allocate on other partitions when 
non-exclusive criteria is met.
In the new API, resource could be allocated on any partition regardless of 
requested partition. (depends on different policy implementation). Which will 
be conflict to our existing APIs for node partitions.

To me, sharing resource between partitions itself is not a clear API:
You can say partition A has total resource = 100G, partition B has total 
resource = 200G. But you cannot say: "under some conditions, partition A can 
use idle resources from partition B" -- because partition is not the entity 
which will consume resources.

Instead, it will be more clear to me to say:
1) Queue's shares of partitions could be dynamically adjusted, OR
2) Node's partition could be dynamically update


> Pluggable sharing policy for Partition Node Label resources
> ---
>
> Key: YARN-4425
> URL: https://issues.apache.org/jira/browse/YARN-4425
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: ResourceSharingPolicyForNodeLabelsPartitions-V1.pdf, 
> ResourceSharingPolicyForNodeLabelsPartitions-V2.pdf, 
> YARN-4425.20160105-1.patch
>
>
> As part of support for sharing NonExclusive Node Label partitions in 
> YARN-3214, NonExclusive partitions are shared only to Default Partitions and 
> also have fixed rule when apps in default partitions makes use of resources 
> of any NonExclusive partitions.
> There are many scenarios where in we require pluggable policy like 
> MutliTenant, Hierarchical etc.. where in each partition can determine when 
> they want to share the resources to other paritions and when other partitions 
> wants to use resources from others
> More details in the attached document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon

2016-04-07 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231430#comment-15231430
 ] 

Li Lu commented on YARN-4928:
-

My only concern is the import line raised by [~djp]. The findbugs warning is 
unrelated to the fix here. 

> Some yarn.server.timeline.* tests fail on Windows attempting to use a test 
> root path containing a colon
> ---
>
> Key: YARN-4928
> URL: https://issues.apache.org/jira/browse/YARN-4928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
> Environment: OS: Windows Server 2012
> JDK: 1.7.0_79
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4928.001.patch, YARN-4928.002.patch, 
> YARN-4928.003.patch
>
>
> yarn.server.timeline.TestEntityGroupFSTimelineStore.* and 
> yarn.server.timeline.TestLogInfo.* fail on Windows, because they are 
> attempting to use a test root paths like 
> "/C:/hdp/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/target/test-dir/TestLogInfo",
>  which contains a ":" (after the Windows drive letter) and 
> DFSUtil.isValidName() does not accept paths containing ":".
> This problem is identical to HDFS-6189, so I suggest to use the same 
> approach: using "/tmp/..." as test root dir instead of 
> System.getProperty("test.build.data", System.getProperty("java.io.tmpdir")).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4756) Unnecessary wait in Node Status Updater during reboot

2016-04-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231422#comment-15231422
 ] 

Hudson commented on YARN-4756:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9576 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9576/])
YARN-4756. Unnecessary wait in Node Status Updater during reboot. (Eric (kasha: 
rev e82f961a3925aadf9e53a009820a48ba9e4f78b6)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerResync.java


> Unnecessary wait in Node Status Updater during reboot
> -
>
> Key: YARN-4756
> URL: https://issues.apache.org/jira/browse/YARN-4756
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
> Attachments: YARN-4756.001.patch, YARN-4756.002.patch, 
> YARN-4756.003.patch, YARN-4756.004.patch, YARN-4756.005.patch
>
>
> The startStatusUpdater thread waits for the isStopped variable to be set to 
> true, but it is waiting for the next heartbeat. During a reboot, the next 
> heartbeat will not come and so the thread waits for a timeout. Instead, we 
> should notify the thread to continue so that it can check the isStopped 
> variable and exit without having to wait for a timeout. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4902) [Umbrella] Generalized and unified scheduling-strategies in YARN


[ 
https://issues.apache.org/jira/browse/YARN-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231401#comment-15231401
 ] 

Wangda Tan commented on YARN-4902:
--

Thanks reviews from [~asuresh]/[~subru]:

Feedbacks to some of your points:
bq. rename the allocation-id proposed here to maybe resource-request-id?
Since the id will be a part of allocated resource (no matter its allocation or 
container). Is it better to rename it to "allocation-request-id"? (cc: 
[~vinodkv])

bq. Now that we are reworking the API from scratch, can we add a cost function 
for the ResourceRequest? I feel Priority is being overloaded to express 
scheduling cost, preemption cost, container types etc.
Could you elaborate "scheduling cost"?
"Preemption cost" should be depends on running process under context of 
reusable allocation, correct? Since user can use the same slot run important 
and less-important workloads. 
"Container type" should be a part of tag per my understanding.

bq. grok why we need both maximum number of allocations & maximum concurrency
Maximum concurrency is to avoid one app take all the cluster. It only limits 
total concurrent resources used by one app.

bq. The current Schedulers will be extremely hard pressed to efficiently handle 
GUTS API requests. I guess this should act as a good motivation to consider an 
application centric approach as opposed to the current node centric one.
Agree, global scheduling becomes important if we want to support such API.


> [Umbrella] Generalized and unified scheduling-strategies in YARN
> 
>
> Key: YARN-4902
> URL: https://issues.apache.org/jira/browse/YARN-4902
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Wangda Tan
> Attachments: Generalized and unified scheduling-strategies in YARN 
> -v0.pdf
>
>
> Apache Hadoop YARN's ResourceRequest mechanism is the core part of the YARN's 
> scheduling API for applications to use. The ResourceRequest mechanism is a 
> powerful API for applications (specifically ApplicationMasters) to indicate 
> to YARN what size of containers are needed, and where in the cluster etc.
> However a host of new feature requirements are making the API increasingly 
> more and more complex and difficult to understand by users and making it very 
> complicated to implement within the code-base.
> This JIRA aims to generalize and unify all such scheduling-strategies in YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231386#comment-15231386
 ] 

Junping Du commented on YARN-4771:
--

002 patch LGTM. 
An additional fix is we'd better to use MonotonicTime to replace 
System.currentTimeMillis() for tracking timeout - just an optional comment, we 
can address here or in a separated jira.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4756) Unnecessary wait in Node Status Updater during reboot


[ 
https://issues.apache.org/jira/browse/YARN-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231376#comment-15231376
 ] 

Karthik Kambatla commented on YARN-4756:


The patch seems reasonable to me. 

+1. Also, quite excited to see a +1 from Hadoop QA. Checking this in. 

> Unnecessary wait in Node Status Updater during reboot
> -
>
> Key: YARN-4756
> URL: https://issues.apache.org/jira/browse/YARN-4756
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
> Attachments: YARN-4756.001.patch, YARN-4756.002.patch, 
> YARN-4756.003.patch, YARN-4756.004.patch, YARN-4756.005.patch
>
>
> The startStatusUpdater thread waits for the isStopped variable to be set to 
> true, but it is waiting for the next heartbeat. During a reboot, the next 
> heartbeat will not come and so the thread waits for a timeout. Instead, we 
> should notify the thread to continue so that it can check the isStopped 
> variable and exit without having to wait for a timeout. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4902) [Umbrella] Generalized and unified scheduling-strategies in YARN

2016-04-07 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231365#comment-15231365
 ] 

Subru Krishnan commented on YARN-4902:
--

Thanks [~vinodkv], [~leftnoteasy], [~jianhe],[~vvasudev] and others for putting 
up this proposal. I went through it & it seems quite relevant with the 
increasing range of workloads we have to support in the near future in YARN.

I have a few high level comments below. Obviously this needs lot more 
thought/discussions.

*GUTS API feedback*:
  - I want to echo [~asuresh]'s comment on consolidating _Allocation-ID_ with 
_Request-ID_ proposed in YARN-4879 and [~vinodkv] seems to agree based on his 
[comments|https://issues.apache.org/jira/browse/YARN-4879?focusedCommentId=15220475]
 .
  - Now that we are reworking the API from scratch, can we add a *cost 
function* for the _ResourceRequest_? I feel _Priority_ is being overloaded to 
express scheduling cost, preemption cost, container types etc.
  - I am not able to grok why we need both _maximum number of allocations & 
maximum concurrency_, especially considering that this on top of the existing 
_numContainers_. Won't they conflict?
  - Can we have a section in the end to explicitly list the mandatory and 
optional attributes at _Application_ and _ResourceRequests_ level. The document 
is rather long and so a snapshot summary will be good.
  - Overall the proposed API seems quite powerful but we should make sure that 
we don't end up trading simplicity for functionality IMHO(this is based on the 
feedback we received for YARN-1051). For instance, the typical MapReduce 
scenario feels more dense when compared to the current APIs but should be more 
easily expressible if we sacrifice on additional flexibility that the GUTS API 
provides. So it'll also be good to have examples of how current constrained 
asks will look like when made through GUTS API.

*Time aspects*: 
  - I agree that we should consolidate the time related placement conditions 
with the work done in YARN-1051. 
  - + capital 1 on your observation that _The reservations feature proposed at 
YARN-1051 can pave a great way for implementing minimumconcurrency_ :).

*Scheduler enhancements*: 
  - The current _Schedulers_ will be extremely hard pressed to efficiently 
handle GUTS API requests. I guess this should act as a good motivation to 
consider an _application centric_ approach as opposed to the current _node 
centric_ one as we have occasionally discussed with [~asuresh], [~kasha], 
[~curino] etc all.

> [Umbrella] Generalized and unified scheduling-strategies in YARN
> 
>
> Key: YARN-4902
> URL: https://issues.apache.org/jira/browse/YARN-4902
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Wangda Tan
> Attachments: Generalized and unified scheduling-strategies in YARN 
> -v0.pdf
>
>
> Apache Hadoop YARN's ResourceRequest mechanism is the core part of the YARN's 
> scheduling API for applications to use. The ResourceRequest mechanism is a 
> powerful API for applications (specifically ApplicationMasters) to indicate 
> to YARN what size of containers are needed, and where in the cluster etc.
> However a host of new feature requirements are making the API increasingly 
> more and more complex and difficult to understand by users and making it very 
> complicated to implement within the code-base.
> This JIRA aims to generalize and unify all such scheduling-strategies in YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (YARN-4865) Track Reserved resources in ResourceUsage and QueueCapacities


[ 
https://issues.apache.org/jira/browse/YARN-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231126#comment-15231126
 ] 

Wangda Tan edited comment on YARN-4865 at 4/7/16 11:56 PM:
---

[~sunilg],

It seems this patch needs one more fix:

{code}
diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
index 9a74c22..df57787 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
@@ -1322,14 +1322,6 @@ public void completedContainer(Resource clusterResource,

 // Book-keeping
 if (removed) {
-
-  // track reserved resource for metrics, for normal container
-  // getReservedResource will be null.
-  Resource reservedRes = rmContainer.getReservedResource();
-  if (reservedRes != null && !reservedRes.equals(Resources.none())) {
-decReservedResource(node.getPartition(), reservedRes);
-  }
-
   // Inform the ordering policy
   orderingPolicy.containerReleased(application, rmContainer);

diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
index cf1b3e0..558fc53 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
@@ -247,6 +247,8 @@ public synchronized boolean unreserve(Priority priority,
   // Update reserved metrics
   queue.getMetrics().unreserveResource(getUser(),
   rmContainer.getReservedResource());
+
+  queue.decReservedResource(node.getPartition(), 
rmContainer.getReservedResource());
   return true;
 }
 return false;
{code}

We need above change to make sure allocation from reserved container will 
correctly deduct reserved resource. [~sunilg], could you add few tests also?

And some other cases in my mind that we need to consider:
- Nodes lost / disconnected, we need to deduct reserved resources on such 
nodes. (I think it should covered by completedContainer code path)

Above can be addressed in a separate JIRA.

(Thanks [~karams] reporting this issue)


was (Author: leftnoteasy):
[~sunilg],

It seems this patch needs one more fix:

{code}
diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
index 9a74c22..df57787 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
@@ -1322,14 +1322,6 @@ public void completedContainer(Resource clusterResource,

 // Book-keeping
 if (removed) {
-
-  // track reserved resource for metrics, for normal container
-  // getReservedResource will be null.
-  Resource reservedRes = rmContainer.getReservedResource();
-  if (reservedRes != null && !reservedRes.equals(Resources.none())) {
-decReservedResource(node.getPartition(), reservedRes);
-  }
-
   // Inform the ordering policy
   orderingPolicy.containerReleased(application, rmContainer);

diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fic

[jira] [Commented] (YARN-4865) Track Reserved resources in ResourceUsage and QueueCapacities

2016-04-07 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231361#comment-15231361
 ] 

Sunil G commented on YARN-4865:
---

Thanks [~leftnoteasy]. I will add some more tests with this suggested change. 

> Track Reserved resources in ResourceUsage and QueueCapacities 
> --
>
> Key: YARN-4865
> URL: https://issues.apache.org/jira/browse/YARN-4865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4865.patch, 0002-YARN-4865.patch, 
> 0003-YARN-4865.patch
>
>
> As discussed in YARN-4678, capture reserved capacity separately in 
> QueueCapcities for better tracking. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4927) TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the default


[ 
https://issues.apache.org/jira/browse/YARN-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231351#comment-15231351
 ] 

Karthik Kambatla commented on YARN-4927:


Thanks for picking this up, [~bibinchundatt]. 

Few comments:
# AdminService: since the test is in the same class, {{refreshAll}} could be 
package-private instead of public. Also, you might want to mark it 
@VisibleForTesting along with a comment that it can be private otherwise. 
# TestRMHA
## The new variable counter could be private to the anonymous AdminService 
class we are creating in the test. 
## The assertion when the RM fails to transition to active seems backwards. 
Shouldn't we be checking {{e.getMessage().contains("")}}? 
## I wonder if we are even running into that exception. If the test is 
expecting the exception, we should add an {{Assert.fail}} right after the call 
to transition to active. 
## Also, I am not a fan of checking just the message verbatim. Can we check if 
the exception is {{ServiceFailedException}} and preferably the expected RM 
state (Active/Standby)? 
## Not introduced in this patch, but the asserts in the test should have a 
corresponding error message to explain what exactly is going on .

> TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the 
> default
> 
>
> Key: YARN-4927
> URL: https://issues.apache.org/jira/browse/YARN-4927
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4927.patch
>
>
> YARN-3893 adds this test, that relies on some CapacityScheduler-specific 
> stuff for refreshAll to fail, which doesn't apply when using FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4931) Preempted resources go back to the same application


 [ 
https://issues.apache.org/jira/browse/YARN-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Crawford updated YARN-4931:
-
Description: 
Sometimes a queue that needs resources causes preemption - but the preempted 
containers are just allocated right back to the application that just released 
them!

Here is a tiny application (0007) that wants resources, and a container is 
preempted from application 0002 to satisfy it:
{code}
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Should preempt  res for 
queue root.default: resDueToMinShare = , resDueToFairShare 
= 
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Preempting container (prio=1res=) from queue root.milesc
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics 
(FairSchedulerUpdateThread): Non-AM container preempted, current 
appAttemptId=appattempt_1460047303577_0002_01, 
containerId=container_1460047303577_0002_01_001038, resource=
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
Transitioned from RUNNING to KILLED
{code}

But then a moment later, application 2 gets the container right back:
{code}
2016-04-07 21:08:13,844 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
(ResourceManager Event Processor): Assigned container 
container_1460047303577_0002_01_001039 of capacity  on 
host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 containers, 
 used and  available after 
allocation
2016-04-07 21:08:14,555 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (IPC 
Server handler 59 on 8030): container_1460047303577_0002_01_001039 Container 
Transitioned from ALLOCATED to ACQUIRED
2016-04-07 21:08:14,845 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_1460047303577_0002_01_001039 
Container Transitioned from ACQUIRED to RUNNING
{code}

This results in new applications being unable to even get an AM, and never 
starting

  was:
Sometimes a queue that needs resources causes preemption - but the preempted 
containers are just allocated right back to the application that just released 
them!

Here is a tiny application (0007) that wants resources, and a container is 
preempted from application 0002 to satisfy it:
{code}
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Should preempt  res for 
queue root.default: resDueToMinShare = , resDueToFairShare 
= 
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Preempting container (prio=1res=) from queue root.milesc
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics 
(FairSchedulerUpdateThread): Non-AM container preempted, current 
appAttemptId=appattempt_1460047303577_0002_01, 
containerId=container_1460047303577_0002_01_001038, resource=
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
Transitioned from RUNNING to KILLED
{/code}

But then a moment later, application 2 gets the container right back:
{code}
2016-04-07 21:08:13,844 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
(ResourceManager Event Processor): Assigned container 
container_1460047303577_0002_01_001039 of capacity  on 
host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 containers, 
 used and  available after 
allocation
2016-04-07 21:08:14,555 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (IPC 
Server handler 59 on 8030): container_1460047303577_0002_01_001039 Container 
Transitioned from ALLOCATED to ACQUIRED
2016-04-07 21:08:14,845 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_1460047303577_0002_01_001039 
Container Transitioned from ACQUIRED to RUNNING
{/code}

This results in new applications being unable to even get an AM, and never 
starting


> Preempted resources go back to the same application
> ---
>
> Key: YARN-4931
> URL: https://issues.apache.org/jira/browse/YARN-4931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: Miles Crawford
> Attach

[jira] [Updated] (YARN-4931) Preempted resources go back to the same application


 [ 
https://issues.apache.org/jira/browse/YARN-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Crawford updated YARN-4931:
-
Description: 
Sometimes a queue that needs resources causes preemption - but the preempted 
containers are just allocated right back to the application that just released 
them!

Here is a tiny application (0007) that wants resources, and a container is 
preempted from application 0002 to satisfy it:
{code}
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Should preempt  res for 
queue root.default: resDueToMinShare = , resDueToFairShare 
= 
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Preempting container (prio=1res=) from queue root.milesc
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics 
(FairSchedulerUpdateThread): Non-AM container preempted, current 
appAttemptId=appattempt_1460047303577_0002_01, 
containerId=container_1460047303577_0002_01_001038, resource=
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
Transitioned from RUNNING to KILLED
{code}

But then a moment later, application 2 gets the container right back:
{code}
2016-04-07 21:08:13,844 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
(ResourceManager Event Processor): Assigned container 
container_1460047303577_0002_01_001039 of capacity  on 
host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 containers, 
 used and  available after 
allocation
2016-04-07 21:08:14,555 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (IPC 
Server handler 59 on 8030): container_1460047303577_0002_01_001039 Container 
Transitioned from ALLOCATED to ACQUIRED
2016-04-07 21:08:14,845 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_1460047303577_0002_01_001039 
Container Transitioned from ACQUIRED to RUNNING
{code}

This results in new applications being unable to even get an AM, and never 
starting at all.

  was:
Sometimes a queue that needs resources causes preemption - but the preempted 
containers are just allocated right back to the application that just released 
them!

Here is a tiny application (0007) that wants resources, and a container is 
preempted from application 0002 to satisfy it:
{code}
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Should preempt  res for 
queue root.default: resDueToMinShare = , resDueToFairShare 
= 
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Preempting container (prio=1res=) from queue root.milesc
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics 
(FairSchedulerUpdateThread): Non-AM container preempted, current 
appAttemptId=appattempt_1460047303577_0002_01, 
containerId=container_1460047303577_0002_01_001038, resource=
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
Transitioned from RUNNING to KILLED
{code}

But then a moment later, application 2 gets the container right back:
{code}
2016-04-07 21:08:13,844 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
(ResourceManager Event Processor): Assigned container 
container_1460047303577_0002_01_001039 of capacity  on 
host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 containers, 
 used and  available after 
allocation
2016-04-07 21:08:14,555 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (IPC 
Server handler 59 on 8030): container_1460047303577_0002_01_001039 Container 
Transitioned from ALLOCATED to ACQUIRED
2016-04-07 21:08:14,845 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_1460047303577_0002_01_001039 
Container Transitioned from ACQUIRED to RUNNING
{code}

This results in new applications being unable to even get an AM, and never 
starting


> Preempted resources go back to the same application
> ---
>
> Key: YARN-4931
> URL: https://issues.apache.org/jira/browse/YARN-4931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: Miles Crawford
>

[jira] [Updated] (YARN-4931) Preempted resources go back to the same application


 [ 
https://issues.apache.org/jira/browse/YARN-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Crawford updated YARN-4931:
-
Attachment: resourcemanager.log

Log snippet showing the behavior in detail.

> Preempted resources go back to the same application
> ---
>
> Key: YARN-4931
> URL: https://issues.apache.org/jira/browse/YARN-4931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: Miles Crawford
> Attachments: resourcemanager.log
>
>
> Sometimes a queue that needs resources causes preemption - but the preempted 
> containers are just allocated right back to the application that just 
> released them!
> Here is a tiny application (0007) that wants resources, and a container is 
> preempted from application 0002 to satisfy it:
> {code}
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> (FairSchedulerUpdateThread): Should preempt  res for 
> queue root.default: resDueToMinShare = , 
> resDueToFairShare = 
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> (FairSchedulerUpdateThread): Preempting container (prio=1res= vCores:1>) from queue root.milesc
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics
>  (FairSchedulerUpdateThread): Non-AM container preempted, current 
> appAttemptId=appattempt_1460047303577_0002_01, 
> containerId=container_1460047303577_0002_01_001038, resource= vCores:1>
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
> Transitioned from RUNNING to KILLED
> {/code}
> But then a moment later, application 2 gets the container right back:
> {code}
> 2016-04-07 21:08:13,844 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
> (ResourceManager Event Processor): Assigned container 
> container_1460047303577_0002_01_001039 of capacity  
> on host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 
> containers,  used and  
> available after allocation
> 2016-04-07 21:08:14,555 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (IPC Server handler 59 on 8030): container_1460047303577_0002_01_001039 
> Container Transitioned from ALLOCATED to ACQUIRED
> 2016-04-07 21:08:14,845 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (ResourceManager Event Processor): container_1460047303577_0002_01_001039 
> Container Transitioned from ACQUIRED to RUNNING
> {/code}
> This results in new applications being unable to even get an AM, and never 
> starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4931) Preempted resources go back to the same application


 [ 
https://issues.apache.org/jira/browse/YARN-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Crawford updated YARN-4931:
-
Summary: Preempted resources go back to the same application  (was: 
Preempted resources go back to )

> Preempted resources go back to the same application
> ---
>
> Key: YARN-4931
> URL: https://issues.apache.org/jira/browse/YARN-4931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: Miles Crawford
>
> Sometimes a queue that needs resources causes preemption - but the preempted 
> containers are just allocated right back to the application that just 
> released them!
> Here is a tiny application (0007) that wants resources, and a container is 
> preempted from application 0002 to satisfy it:
> {code}
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> (FairSchedulerUpdateThread): Should preempt  res for 
> queue root.default: resDueToMinShare = , 
> resDueToFairShare = 
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> (FairSchedulerUpdateThread): Preempting container (prio=1res= vCores:1>) from queue root.milesc
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics
>  (FairSchedulerUpdateThread): Non-AM container preempted, current 
> appAttemptId=appattempt_1460047303577_0002_01, 
> containerId=container_1460047303577_0002_01_001038, resource= vCores:1>
> 2016-04-07 21:08:13,463 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
> Transitioned from RUNNING to KILLED
> {/code}
> But then a moment later, application 2 gets the container right back:
> {code}
> 2016-04-07 21:08:13,844 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
> (ResourceManager Event Processor): Assigned container 
> container_1460047303577_0002_01_001039 of capacity  
> on host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 
> containers,  used and  
> available after allocation
> 2016-04-07 21:08:14,555 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (IPC Server handler 59 on 8030): container_1460047303577_0002_01_001039 
> Container Transitioned from ALLOCATED to ACQUIRED
> 2016-04-07 21:08:14,845 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (ResourceManager Event Processor): container_1460047303577_0002_01_001039 
> Container Transitioned from ACQUIRED to RUNNING
> {/code}
> This results in new applications being unable to even get an AM, and never 
> starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4931) Preempted resources go back to

Miles Crawford created YARN-4931:


 Summary: Preempted resources go back to 
 Key: YARN-4931
 URL: https://issues.apache.org/jira/browse/YARN-4931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.2
Reporter: Miles Crawford


Sometimes a queue that needs resources causes preemption - but the preempted 
containers are just allocated right back to the application that just released 
them!

Here is a tiny application (0007) that wants resources, and a container is 
preempted from application 0002 to satisfy it:
{code}
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Should preempt  res for 
queue root.default: resDueToMinShare = , resDueToFairShare 
= 
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
(FairSchedulerUpdateThread): Preempting container (prio=1res=) from queue root.milesc
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics 
(FairSchedulerUpdateThread): Non-AM container preempted, current 
appAttemptId=appattempt_1460047303577_0002_01, 
containerId=container_1460047303577_0002_01_001038, resource=
2016-04-07 21:08:13,463 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(FairSchedulerUpdateThread): container_1460047303577_0002_01_001038 Container 
Transitioned from RUNNING to KILLED
{/code}

But then a moment later, application 2 gets the container right back:
{code}
2016-04-07 21:08:13,844 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode 
(ResourceManager Event Processor): Assigned container 
container_1460047303577_0002_01_001039 of capacity  on 
host ip-10-12-40-63.us-west-2.compute.internal:8041, which has 13 containers, 
 used and  available after 
allocation
2016-04-07 21:08:14,555 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (IPC 
Server handler 59 on 8030): container_1460047303577_0002_01_001039 Container 
Transitioned from ALLOCATED to ACQUIRED
2016-04-07 21:08:14,845 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_1460047303577_0002_01_001039 
Container Transitioned from ACQUIRED to RUNNING
{/code}

This results in new applications being unable to even get an AM, and never 
starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4851) Metric improvements for ATS v1.5 storage

2016-04-07 Thread Li Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-4851:

Attachment: YARN-4851-trunk.001.patch

First draft of the patch. In this patch I've added some metrics for the ATS 
v1.5 storage. Specifically:

For overall system usage:
- Number of read requests to summary storage
- Number of read requests to detail storage
- Number of entities scanned by EntityGroupFS storage into the summary storage. 
- Accumulated time spent on scanning new apps in the active directory
- Accumulated time spent on reading summary data into the summary storage

Caching performance:
- Number of cache storage refreshes (cache reloads). This can be compared to 
the number of read requests to detail storage to understand how useful a 
caching layer is for specific cluster workload. 
- Accumulated time spent on refreshing cache storages. 

Log cleaner/purging:
- Number of dirs purged by the storage
- Accumulated time for log cleaning. 

> Metric improvements for ATS v1.5 storage
> 
>
> Key: YARN-4851
> URL: https://issues.apache.org/jira/browse/YARN-4851
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>Assignee: Li Lu
> Attachments: YARN-4851-trunk.001.patch
>
>
> We can add more metrics to the ATS v1.5 storage systems, including purging, 
> cache hit/misses, read latency, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4733) [YARN-3368] Initial commit of new YARN web UI


 [ 
https://issues.apache.org/jira/browse/YARN-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4733:
-
Summary: [YARN-3368] Initial commit of new YARN web UI  (was: [YARN-3368] 
Commit initial web UI patch to branch: YARN-3368)

> [YARN-3368] Initial commit of new YARN web UI
> -
>
> Key: YARN-4733
> URL: https://issues.apache.org/jira/browse/YARN-4733
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: YARN-3368
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4514) [YARN-3368] Cleanup hardcoded configurations, such as RM/ATS addresses


[ 
https://issues.apache.org/jira/browse/YARN-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231152#comment-15231152
 ] 

Hadoop QA commented on YARN-4514:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s {color} 
| {color:red} YARN-4514 does not apply to YARN-3368. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12797310/YARN-4514-YARN-3368.4.patch
 |
| JIRA Issue | YARN-4514 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/10987/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> [YARN-3368] Cleanup hardcoded configurations, such as RM/ATS addresses
> --
>
> Key: YARN-4514
> URL: https://issues.apache.org/jira/browse/YARN-4514
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: YARN-4514-YARN-3368.1.patch, 
> YARN-4514-YARN-3368.2.patch, YARN-4514-YARN-3368.3.patch, 
> YARN-4514-YARN-3368.4.patch
>
>
> We have several configurations are hard-coded, for example, RM/ATS addresses, 
> we should make them configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (YARN-4514) [YARN-3368] Cleanup hardcoded configurations, such as RM/ATS addresses


 [ 
https://issues.apache.org/jira/browse/YARN-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened YARN-4514:
--

> [YARN-3368] Cleanup hardcoded configurations, such as RM/ATS addresses
> --
>
> Key: YARN-4514
> URL: https://issues.apache.org/jira/browse/YARN-4514
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: YARN-4514-YARN-3368.1.patch, 
> YARN-4514-YARN-3368.2.patch, YARN-4514-YARN-3368.3.patch, 
> YARN-4514-YARN-3368.4.patch
>
>
> We have several configurations are hard-coded, for example, RM/ATS addresses, 
> we should make them configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4514) [YARN-3368] Cleanup hardcoded configurations, such as RM/ATS addresses


 [ 
https://issues.apache.org/jira/browse/YARN-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-4514.
--
Resolution: Fixed

Have to resolve and reopen to set status to be patch available.

> [YARN-3368] Cleanup hardcoded configurations, such as RM/ATS addresses
> --
>
> Key: YARN-4514
> URL: https://issues.apache.org/jira/browse/YARN-4514
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: YARN-4514-YARN-3368.1.patch, 
> YARN-4514-YARN-3368.2.patch, YARN-4514-YARN-3368.3.patch, 
> YARN-4514-YARN-3368.4.patch
>
>
> We have several configurations are hard-coded, for example, RM/ATS addresses, 
> we should make them configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and fix licenses.


 [ 
https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4849:
-
Summary: [YARN-3368] cleanup code base, integrate web UI related build to 
mvn, and fix licenses.  (was: [YARN-3368] cleanup code base, integrate web UI 
related build to mvn, and add licenses.)

> [YARN-3368] cleanup code base, integrate web UI related build to mvn, and fix 
> licenses.
> ---
>
> Key: YARN-4849
> URL: https://issues.apache.org/jira/browse/YARN-4849
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4849-YARN-3368.1.patch, 
> YARN-4849-YARN-3368.2.patch, YARN-4849-YARN-3368.3.patch, 
> YARN-4849-YARN-3368.4.patch, YARN-4849-YARN-3368.5.patch, 
> YARN-4849-YARN-3368.6.patch, YARN-4849-YARN-3368.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.


[ 
https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231135#comment-15231135
 ] 

Wangda Tan commented on YARN-4849:
--

Thanks for review, [~sunilg].

ASF license issue is not caused by the patch.

Committing to branch-3368.

> [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add 
> licenses.
> ---
>
> Key: YARN-4849
> URL: https://issues.apache.org/jira/browse/YARN-4849
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4849-YARN-3368.1.patch, 
> YARN-4849-YARN-3368.2.patch, YARN-4849-YARN-3368.3.patch, 
> YARN-4849-YARN-3368.4.patch, YARN-4849-YARN-3368.5.patch, 
> YARN-4849-YARN-3368.6.patch, YARN-4849-YARN-3368.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4927) TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the default


[ 
https://issues.apache.org/jira/browse/YARN-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231137#comment-15231137
 ] 

Hadoop QA commented on YARN-4927:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 79m 59s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 49s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 155m 19s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.webapp.TestRMWithCSRFFilter |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
| JDK v1.8.0_77 Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | 
hadoop.y

[jira] [Commented] (YARN-4865) Track Reserved resources in ResourceUsage and QueueCapacities


[ 
https://issues.apache.org/jira/browse/YARN-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231126#comment-15231126
 ] 

Wangda Tan commented on YARN-4865:
--

[~sunilg],

It seems this patch needs one more fix:

{code}
diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
index 9a74c22..df57787 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
@@ -1322,14 +1322,6 @@ public void completedContainer(Resource clusterResource,

 // Book-keeping
 if (removed) {
-
-  // track reserved resource for metrics, for normal container
-  // getReservedResource will be null.
-  Resource reservedRes = rmContainer.getReservedResource();
-  if (reservedRes != null && !reservedRes.equals(Resources.none())) {
-decReservedResource(node.getPartition(), reservedRes);
-  }
-
   // Inform the ordering policy
   orderingPolicy.containerReleased(application, rmContainer);

diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
index cf1b3e0..558fc53 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java
@@ -247,6 +247,8 @@ public synchronized boolean unreserve(Priority priority,
   // Update reserved metrics
   queue.getMetrics().unreserveResource(getUser(),
   rmContainer.getReservedResource());
+
+  queue.decReservedResource(node.getPartition(), 
rmContainer.getReservedResource());
   return true;
 }
 return false;
{code}

We need above change to make sure allocation from reserved container will 
correctly deduct reserved resource. [~sunilg], could you add few tests also?

And some other cases in my mind that we need to consider:
- Nodes lost / disconnected, we need to deduct reserved resources on such 
nodes. (I think it should covered by completedContainer code path)

Above can be addressed in a separate JIRA.

> Track Reserved resources in ResourceUsage and QueueCapacities 
> --
>
> Key: YARN-4865
> URL: https://issues.apache.org/jira/browse/YARN-4865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4865.patch, 0002-YARN-4865.patch, 
> 0003-YARN-4865.patch
>
>
> As discussed in YARN-4678, capture reserved capacity separately in 
> QueueCapcities for better tracking. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4781) Support intra-queue preemption for fairness ordering policy.


[ 
https://issues.apache.org/jira/browse/YARN-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231088#comment-15231088
 ] 

Wangda Tan commented on YARN-4781:
--

[~sunilg],

We should make sure this JIRA contains infrastructure that can be used by other 
intra-queue preemption policies such as priority-based.

[~milesc],

Thanks for sharing your use cases,  hopefully you don't need creating one queue 
for each job any more after this feature :) 

> Support intra-queue preemption for fairness ordering policy.
> 
>
> Key: YARN-4781
> URL: https://issues.apache.org/jira/browse/YARN-4781
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> We introduced fairness queue policy since YARN-3319, which will let large 
> applications make progresses and not starve small applications. However, if a 
> large application takes the queue’s resources, and containers of the large 
> app has long lifespan, small applications could still wait for resources for 
> long time and SLAs cannot be guaranteed.
> Instead of wait for application release resources on their own, we need to 
> preempt resources of queue with fairness policy enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4552) NM ResourceLocalizationService should check and initialize local filecache dir (and log dir) even if NM recover is enabled.


[ 
https://issues.apache.org/jira/browse/YARN-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230992#comment-15230992
 ] 

Junping Du commented on YARN-4552:
--

Hi [~vinodkv], I cannot reproduce this issue in current 2.7 branch. I will 
double check if I miss something in reproduce process. Let's remove the target 
version but keep it open for more investigation later.

> NM ResourceLocalizationService should check and initialize local filecache 
> dir (and log dir) even if NM recover is enabled.
> ---
>
> Key: YARN-4552
> URL: https://issues.apache.org/jira/browse/YARN-4552
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4552-v2.patch, YARN-4552.patch
>
>
> In some cases, user are cleanup localized file cache for debugging/trouble 
> shooting purpose during NM down time. However, after bring back NM (with 
> recovery enabled), the job submission could be failed for exception like 
> below:
> {noformat}
> Diagnostics: java.io.FileNotFoundException: File 
> /disk/12/yarn/local/filecache does not exist.
> {noformat}
> This is due to we only create filecache dir when recover is not enabled 
> during ResourceLocalizationService get initialized/started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4552) NM ResourceLocalizationService should check and initialize local filecache dir (and log dir) even if NM recover is enabled.


 [ 
https://issues.apache.org/jira/browse/YARN-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-4552:
-
Target Version/s:   (was: 2.8.0, 2.7.3, 2.6.5)

> NM ResourceLocalizationService should check and initialize local filecache 
> dir (and log dir) even if NM recover is enabled.
> ---
>
> Key: YARN-4552
> URL: https://issues.apache.org/jira/browse/YARN-4552
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4552-v2.patch, YARN-4552.patch
>
>
> In some cases, user are cleanup localized file cache for debugging/trouble 
> shooting purpose during NM down time. However, after bring back NM (with 
> recovery enabled), the job submission could be failed for exception like 
> below:
> {noformat}
> Diagnostics: java.io.FileNotFoundException: File 
> /disk/12/yarn/local/filecache does not exist.
> {noformat}
> This is due to we only create filecache dir when recover is not enabled 
> during ResourceLocalizationService get initialized/started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4927) TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the default


 [ 
https://issues.apache.org/jira/browse/YARN-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4927:
---
Assignee: Bibin A Chundatt  (was: Karthik Kambatla)

> TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the 
> default
> 
>
> Key: YARN-4927
> URL: https://issues.apache.org/jira/browse/YARN-4927
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4927.patch
>
>
> YARN-3893 adds this test, that relies on some CapacityScheduler-specific 
> stuff for refreshAll to fail, which doesn't apply when using FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4821) Have a separate NM timeline publishing-interval

2016-04-07 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230903#comment-15230903
 ] 

Naganarasimha G R commented on YARN-4821:
-

Thanks for the comments [~vinodkv],
bq. We should completely decouple these two. If the publishing-interval is 
configured to be not a multiple of the monitoring-interval, the publisher could 
only look at the last N values from the monitor before the last cycle.
As we discussed in the meeting, IMHO i thought its much simpler for user to 
configure just the multiple of monitoring interval after which the ATS event 
will be published for the resource usage. If not user needs to be made aware of 
the relation between   publishing-interval and monitoring interval.  So it 
would be something like *monitoring interval = 3 seconds, publish frequency= 
5*, then after 3*5 =15 seconds, average of 5 values will be published .
May be i can come up with a WIP patch based on this and discuss whether its fine
Will go through YARN-3332 before working on the patch.

> Have a separate NM timeline publishing-interval
> ---
>
> Key: YARN-4821
> URL: https://issues.apache.org/jira/browse/YARN-4821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> Currently the interval with which NM publishes container CPU and memory 
> metrics is tied to {{yarn.nodemanager.resource-monitor.interval-ms}} whose 
> default is 3 seconds. This is too aggressive.
> There should be a separate configuration that controls how often 
> {{NMTimelinePublisher}} publishes container metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4821) Have a separate NM timeline publishing-interval

2016-04-07 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230813#comment-15230813
 ] 

Vinod Kumar Vavilapalli commented on YARN-4821:
---

bq. This proposal is simply to use a different publishing interval just for the 
timeline publishing
+1. We should completely decouple these two. If the publishing-interval is 
configured to be not a multiple of the monitoring-interval, the publisher could 
only look at the last N values from the monitor before the last cycle.

Can you also please have a read at YARN-3332 and see if you can organize code 
in a bit of independent way?

A related data point for deciding the interval itself - the Hadoop Metrics 
plugin pulls metrics from all of our daemons and pushes them out periodically - 
with a default value of 10 sec IIRC. This is the periodicity for most of the 
production clusters. Assuming adding container-metrics data to this still keep 
the total outgoing data to the same or immediate order of magnitude (say 250 
metrics per NM + (50 containers * 50 metrics)), we should be okay with the same 
frequency. Anything more frequent will need careful benchmarking.

> Have a separate NM timeline publishing-interval
> ---
>
> Key: YARN-4821
> URL: https://issues.apache.org/jira/browse/YARN-4821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> Currently the interval with which NM publishes container CPU and memory 
> metrics is tied to {{yarn.nodemanager.resource-monitor.interval-ms}} whose 
> default is 3 seconds. This is too aggressive.
> There should be a separate configuration that controls how often 
> {{NMTimelinePublisher}} publishes container metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4821) Have a separate NM timeline publishing-interval

2016-04-07 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-4821:
--
Summary: Have a separate NM timeline publishing-interval  (was: have a 
separate NM timeline publishing interval)

> Have a separate NM timeline publishing-interval
> ---
>
> Key: YARN-4821
> URL: https://issues.apache.org/jira/browse/YARN-4821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> Currently the interval with which NM publishes container CPU and memory 
> metrics is tied to {{yarn.nodemanager.resource-monitor.interval-ms}} whose 
> default is 3 seconds. This is too aggressive.
> There should be a separate configuration that controls how often 
> {{NMTimelinePublisher}} publishes container metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3461) Consolidate flow name/version/run defaults


[ 
https://issues.apache.org/jira/browse/YARN-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230620#comment-15230620
 ] 

Sangjin Lee commented on YARN-3461:
---

Thanks folks!

> Consolidate flow name/version/run defaults
> --
>
> Key: YARN-3461
> URL: https://issues.apache.org/jira/browse/YARN-3461
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Sangjin Lee
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-3461-YARN-2928.01.patch, 
> YARN-3461-YARN-2928.02.patch, YARN-3461-YARN-2928.03.patch
>
>
> In YARN-3391, it's not resolved what should be the defaults for flow 
> name/version/run. Let's continue the discussion here and unblock YARN-3391 
> from moving forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3461) Consolidate flow name/version/run defaults

2016-04-07 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230556#comment-15230556
 ] 

Varun Saxena commented on YARN-3461:


I have committed this to YARN-2928 branch.
Thanks [~sjlee0] for your contribution and thanks [~Naganarasimha] and 
[~gtCarrera9] for additional reviews.

> Consolidate flow name/version/run defaults
> --
>
> Key: YARN-3461
> URL: https://issues.apache.org/jira/browse/YARN-3461
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Sangjin Lee
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-3461-YARN-2928.01.patch, 
> YARN-3461-YARN-2928.02.patch, YARN-3461-YARN-2928.03.patch
>
>
> In YARN-3391, it's not resolved what should be the defaults for flow 
> name/version/run. Let's continue the discussion here and unblock YARN-3391 
> from moving forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4927) TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the default


[ 
https://issues.apache.org/jira/browse/YARN-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230544#comment-15230544
 ] 

Bibin A Chundatt commented on YARN-4927:


[~kasha]

Apologies for not  considering case whr default scheduler could be 
FairScheduler . Attaching patch to handle the same .

> TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the 
> default
> 
>
> Key: YARN-4927
> URL: https://issues.apache.org/jira/browse/YARN-4927
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: 0001-YARN-4927.patch
>
>
> YARN-3893 adds this test, that relies on some CapacityScheduler-specific 
> stuff for refreshAll to fail, which doesn't apply when using FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4927) TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the default


 [ 
https://issues.apache.org/jira/browse/YARN-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4927:
---
Attachment: 0001-YARN-4927.patch

> TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the 
> default
> 
>
> Key: YARN-4927
> URL: https://issues.apache.org/jira/browse/YARN-4927
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: 0001-YARN-4927.patch
>
>
> YARN-3893 adds this test, that relies on some CapacityScheduler-specific 
> stuff for refreshAll to fail, which doesn't apply when using FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4929) Fix unit test case failures because of removing the minimum wait time for attempt.

2016-04-07 Thread Yufei Gu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yufei Gu updated YARN-4929:
---
Description: 
The following unit test cases failed because of we remove the minimum wait time 
for attempt in YARN-4807

- TestAMRestart.testRMAppAttemptFailuresValidityInterval 
- TestApplicationMasterService.testResourceTypes
- TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers
- TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForFairSche
- TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForCapacitySche

  was:
The following unit test cases failed because of we remove the minimum wait time 
for attempt. 

- TestAMRestart.testRMAppAttemptFailuresValidityInterval 
- TestApplicationMasterService.testResourceTypes
- TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers
- TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForFairSche
- TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForCapacitySche


> Fix unit test case failures because of removing the minimum wait time for 
> attempt.
> --
>
> Key: YARN-4929
> URL: https://issues.apache.org/jira/browse/YARN-4929
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>
> The following unit test cases failed because of we remove the minimum wait 
> time for attempt in YARN-4807
> - TestAMRestart.testRMAppAttemptFailuresValidityInterval 
> - TestApplicationMasterService.testResourceTypes
> - TestContainerResourceUsage.testUsageAfterAMRestartWithMultipleContainers
> - TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForFairSche
> - TestRMApplicationHistoryWriter.testRMWritingMassiveHistoryForCapacitySche



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-04-07 Thread Jonathan Maron (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230523#comment-15230523
 ] 

Jonathan Maron commented on YARN-4757:
--

Sounds like a good approach to me.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3461) Consolidate flow name/version/run defaults

2016-04-07 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230522#comment-15230522
 ] 

Varun Saxena commented on YARN-3461:


Latest patch LGTM.
Will commit it shortly.

> Consolidate flow name/version/run defaults
> --
>
> Key: YARN-3461
> URL: https://issues.apache.org/jira/browse/YARN-3461
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Sangjin Lee
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3461-YARN-2928.01.patch, 
> YARN-3461-YARN-2928.02.patch, YARN-3461-YARN-2928.03.patch
>
>
> In YARN-3391, it's not resolved what should be the defaults for flow 
> name/version/run. Let's continue the discussion here and unblock YARN-3391 
> from moving forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230491#comment-15230491
 ] 

Varun Vasudev commented on YARN-4757:
-

[~jmaron] - given the feedback and the scope of the changes involved here, I 
think we should just develop this in a branch and file sub tasks to ensure we 
address concerns like the ones Allen has raised.

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4736) Issues with HBaseTimelineWriterImpl


 [ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-4736:
--
Labels:   (was: yarn-2928-1st-milestone)

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon


[ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230433#comment-15230433
 ] 

Hadoop QA commented on YARN-4928:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
31s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 13s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 19s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 25s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage
 in trunk has 1 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 11s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 10s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 10s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 16s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 9s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 11s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 51s 
{color} | {color:green} hadoop-yarn-server-timeline-pluginstorage in the patch 
passed with JDK v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 53s 
{color} | {color:green} hadoop-yarn-server-timeline-pluginstorage in the patch 
passed with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 13m 56s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12797535/YARN-4928.003.patch |
| JIRA Issue | YARN-4928 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 16095ff576a1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/ha

[jira] [Commented] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon


[ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230422#comment-15230422
 ] 

Junping Du commented on YARN-4928:
--

003 patch looks good in general. A NIT is we don't use below import pattern in 
most practices.
{noformat}
+import org.apache.hadoop.fs.*;
{noformat}

CC [~gtCarrera9] who is author of related test case.

> Some yarn.server.timeline.* tests fail on Windows attempting to use a test 
> root path containing a colon
> ---
>
> Key: YARN-4928
> URL: https://issues.apache.org/jira/browse/YARN-4928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
> Environment: OS: Windows Server 2012
> JDK: 1.7.0_79
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4928.001.patch, YARN-4928.002.patch, 
> YARN-4928.003.patch
>
>
> yarn.server.timeline.TestEntityGroupFSTimelineStore.* and 
> yarn.server.timeline.TestLogInfo.* fail on Windows, because they are 
> attempting to use a test root paths like 
> "/C:/hdp/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/target/test-dir/TestLogInfo",
>  which contains a ":" (after the Windows drive letter) and 
> DFSUtil.isValidName() does not accept paths containing ":".
> This problem is identical to HDFS-6189, so I suggest to use the same 
> approach: using "/tmp/..." as test root dir instead of 
> System.getProperty("test.build.data", System.getProperty("java.io.tmpdir")).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon


[ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230408#comment-15230408
 ] 

Gergely Novák commented on YARN-4928:
-

Sorry, I used on older (incompatible) branch for the patch, v003 now works for 
branch-2.8 and trunk. 

> Some yarn.server.timeline.* tests fail on Windows attempting to use a test 
> root path containing a colon
> ---
>
> Key: YARN-4928
> URL: https://issues.apache.org/jira/browse/YARN-4928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
> Environment: OS: Windows Server 2012
> JDK: 1.7.0_79
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4928.001.patch, YARN-4928.002.patch, 
> YARN-4928.003.patch
>
>
> yarn.server.timeline.TestEntityGroupFSTimelineStore.* and 
> yarn.server.timeline.TestLogInfo.* fail on Windows, because they are 
> attempting to use a test root paths like 
> "/C:/hdp/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/target/test-dir/TestLogInfo",
>  which contains a ":" (after the Windows drive letter) and 
> DFSUtil.isValidName() does not accept paths containing ":".
> This problem is identical to HDFS-6189, so I suggest to use the same 
> approach: using "/tmp/..." as test root dir instead of 
> System.getProperty("test.build.data", System.getProperty("java.io.tmpdir")).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon


 [ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergely Novák updated YARN-4928:

Attachment: YARN-4928.003.patch

> Some yarn.server.timeline.* tests fail on Windows attempting to use a test 
> root path containing a colon
> ---
>
> Key: YARN-4928
> URL: https://issues.apache.org/jira/browse/YARN-4928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
> Environment: OS: Windows Server 2012
> JDK: 1.7.0_79
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4928.001.patch, YARN-4928.002.patch, 
> YARN-4928.003.patch
>
>
> yarn.server.timeline.TestEntityGroupFSTimelineStore.* and 
> yarn.server.timeline.TestLogInfo.* fail on Windows, because they are 
> attempting to use a test root paths like 
> "/C:/hdp/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/target/test-dir/TestLogInfo",
>  which contains a ":" (after the Windows drive letter) and 
> DFSUtil.isValidName() does not accept paths containing ":".
> This problem is identical to HDFS-6189, so I suggest to use the same 
> approach: using "/tmp/..." as test root dir instead of 
> System.getProperty("test.build.data", System.getProperty("java.io.tmpdir")).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop


[ 
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230349#comment-15230349
 ] 

Varun Vasudev commented on YARN-4876:
-

Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

I think we should avoid this for now - we should require that AMs that use 
initialize() must call destroy and AMs that call start with the 
ContainerLaunchContext can't call destroy. We can achieve that by adding the 
destroyDelay field you mentioned in your document but don't allow AMs to set 
it. If initialize is called, set destroyDelay internally to \-1, else to 0. I'm 
not saying we should drop the feature, just that we should come back to it once 
we've sorted out the lifecycle from an initialize->destroy perspective.

{code}
Modify 'StopContainerRequest' Record:
  Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for 
now.

{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the 
'START_CONTAINER' event.
{code}

Can you add a state machine transition diagram to explain how new states and 
events affect each other?

{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM 
while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a 
CONTAINER_RESOURCE_CLEANUP and an INIT_CONTAINER event to kick of 
re-localization after which the Container will return to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an 
explicit re-initialize API. For a running process, ideally, we want to localize 
the upgraded bits while the container is running and then kill the existing 
process to minimize the downtime. For containers where localization can take a 
long time, forcing a kill and then a re-initialize adds quite a serious amount 
of downtime. Re-initialize and initialize will probably end up having differing 
behaviors. On a similar note, I think we might have to introduce a new 
"re-initalizing/re-localizing/running-localizing state" which implies that a 
container is running but we are carrying out some background work.
In addition, I don't think we can do a cleanup of resources during an upgrade. 
For services that have local state in the container work dir, we're essentially 
wiping away all the local state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm 
assuming you meant CLEANUP_CONTAINER_RESOURCES

{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the 
AM, it is considered a restart, and will follow the same code path as 
'initializeContainer' with new ContainerLaunchContext, but will not perform a 
CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER. The Container process will be 
killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM, 
it treated exactly as the above case, but it will also trigger a 
START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API 
that does everything you've outlined above? It's cleaner and we don't have to 
do as many condition checks. In addition, with a restart API we can do stuff 
like allowing AMs to specify a delay, or some conditions when the restart 
should happen.

> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> --
>
> Key: YARN-4876
> URL: https://issues.apache.org/jira/browse/YARN-4876
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the 
> *ContainerManagementProtocol* and decouple the actual start of a container 
> from the initialization. This will allow AMs to re-start a container without 
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the 
> initialize (and the cleanup with the destroy), This can also be used by 
> applications to upgrade a Container by *re-initializing* with a new 
> *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop

[
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230349#comment-15230349
]

Varun Vasudev edited comment on YARN-4876 at 4/7/16 2:52 PM:
-

Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

I think we should avoid this for now - we should require that AMs that use
initialize() must call destroy and AMs that call start with the
ContainerLaunchContext can't call destroy. We can achieve that by adding the
destroyDelay field you mentioned in your document but don't allow AMs to set
it. If initialize is called, set destroyDelay internally to \-1, else to 0. I'm
not saying we should drop the feature, just that we should come back to it once
we've sorted out the lifecycle from an initialize->destroy perspective.

{code}
Modify 'StopContainerRequest' Record:
Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for
now.

{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the
'START_CONTAINER' event.
{code}

Can you add a state machine transition diagram to explain how new states and
events affect each other?

{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM
while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a
CONTAINER_RESOURCE_CLEANUP and an INIT_CONTAINER event to kick of
re-localization after which the Container will return to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an
explicit re-initialize/re-localize API. For a running process, ideally, we want
to localize the upgraded bits while the container is running and then kill the
existing process to minimize the downtime. For containers where localization
can take a long time, forcing a kill and then a re-initialize adds quite a
serious amount of downtime. Re-initialize and initialize will probably end up
having differing behaviors. On a similar note, I think we might have to
introduce a new "re-initalizing/re-localizing/running-localizing state" which
implies that a container is running but we are carrying out some background
work.
In addition, I don't think we can do a cleanup of resources during an upgrade.
For services that have local state in the container work dir, we're essentially
wiping away all the local state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm
assuming you meant CLEANUP_CONTAINER_RESOURCES

{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the
AM, it is considered a restart, and will follow the same code path as
'initializeContainer' with new ContainerLaunchContext, but will not perform a
CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER. The Container process will be
killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM,
it treated exactly as the above case, but it will also trigger a
START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API
that does everything you've outlined above? It's cleaner and we don't have to
do as many condition checks. In addition, with a restart API we can do stuff
like allowing AMs to specify a delay, or some conditions when the restart
should happen.

was (Author: vvasudev):
Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

{code}
Modify 'StopContainerRequest' Record:
Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for
now.

Can you add a state machine transition diagram to explain how new states and
events affect each other?

{code}
If 'initi

[jira] [Commented] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon


[ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230307#comment-15230307
 ] 

Gergely Novák commented on YARN-4928:
-

Uploaded a new patch (v.002) that seems to work as originally intended: like in 
HDFS-6189 it uses {{System.getProperty("test.build.data", 
System.getProperty("java.io.tmpdir"))}} as the base directory for 
MiniDFSCluster, and {{/tmp/...}} as (and only as) HDFS path (in accordance with 
[~arpitagarwal]'s comment). So it does not use {{C:\tmp}} on the local file 
system, but still works on Windows too (because {{DFSUtil.isValidName()}} 
checks the HDFS path, not the local path).

[~djp] or [~ste...@apache.org] could you please review?

> Some yarn.server.timeline.* tests fail on Windows attempting to use a test 
> root path containing a colon
> ---
>
> Key: YARN-4928
> URL: https://issues.apache.org/jira/browse/YARN-4928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
> Environment: OS: Windows Server 2012
> JDK: 1.7.0_79
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4928.001.patch, YARN-4928.002.patch
>
>
> yarn.server.timeline.TestEntityGroupFSTimelineStore.* and 
> yarn.server.timeline.TestLogInfo.* fail on Windows, because they are 
> attempting to use a test root paths like 
> "/C:/hdp/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/target/test-dir/TestLogInfo",
>  which contains a ":" (after the Windows drive letter) and 
> DFSUtil.isValidName() does not accept paths containing ":".
> This problem is identical to HDFS-6189, so I suggest to use the same 
> approach: using "/tmp/..." as test root dir instead of 
> System.getProperty("test.build.data", System.getProperty("java.io.tmpdir")).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery


[ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230296#comment-15230296
 ] 

Bibin A Chundatt commented on YARN-3971:


Test failure looks like not related to patch attached.Due to bind exception the 
same failed.
{noformat}
com.sun.jersey.test.framework.spi.container.TestContainerException: 
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:463)
at sun.nio.ch.Net.bind(Net.java:455)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at 
org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:413)
at 
org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:384)
at 
org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:375)
at 
org.glassfish.grizzly.http.server.NetworkListener.start(NetworkListener.java:549)
at 
org.glassfish.grizzly.http.server.HttpServer.start(HttpServer.java:255)
at 
com.sun.jersey.api.container.grizzly2.GrizzlyServerFactory.createHttpServer(GrizzlyServerFactory.java:326)
at 
com.sun.jersey.api.container.grizzly2.GrizzlyServerFactory.createHttpServer(GrizzlyServerFactory.java:343)
{noformat}

> Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
> recovery
> --
>
> Key: YARN-3971
> URL: https://issues.apache.org/jira/browse/YARN-3971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
> 0003-YARN-3971.patch, 0004-YARN-3971.patch, 0005-YARN-3971.addendum.patch, 
> 0005-YARN-3971.patch
>
>
> Steps to reproduce 
> # Create label x,y
> # Delete label x,y
> # Create label x,y add capacity scheduler xml for labels x and y too
> # Restart RM 
>  
> Both RM will become Standby.
> Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
> {code}
> 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
> queue=a1 is using this label. Please remove label on queue before remove the 
> label
> java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
> label. Please remove label on queue before remove the label
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at 
> org.apache

[jira] [Commented] (YARN-3959) Store application related configurations in Timeline Service v2

2016-04-07 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230253#comment-15230253
 ] 

Varun Saxena commented on YARN-3959:


I mean should be doable for 1st milestone.

> Store application related configurations in Timeline Service v2
> ---
>
> Key: YARN-3959
> URL: https://issues.apache.org/jira/browse/YARN-3959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
>
> We already have configuration field in HBase schema for application entity. 
> We need to make sure AM write it out when it get launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-07 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230248#comment-15230248
 ] 

Jason Lowe commented on YARN-4924:
--

Yeah, now that the NM registers with the list of apps it thinks are active and 
the RM tells it to finish any apps that shouldn't be active we should be 
covered.  We'll need to leave in some recovery code for finished apps so we can 
clean up any lingering finished app events from the state store, but we can 
remove the code to store the events.

> NM recovery race can lead to container not cleaned up
> -
>
> Key: YARN-4924
> URL: https://issues.apache.org/jira/browse/YARN-4924
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 2.7.2
>Reporter: Nathan Roberts
>
> It's probably a small window but we observed a case where the NM crashed and 
> then a container was not properly cleaned up during recovery.
> I will add details in first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery


[ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230172#comment-15230172
 ] 

Hadoop QA commented on YARN-3971:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 38s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
33s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
55s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 40s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
31s {color} | {color:green} hadoop-yarn-project/hadoop-yarn: patch generated 0 
new + 38 unchanged - 1 fixed = 38 total (was 39) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 13s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 22s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:blac

[jira] [Commented] (YARN-4927) TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the default


[ 
https://issues.apache.org/jira/browse/YARN-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230159#comment-15230159
 ] 

Karthik Kambatla commented on YARN-4927:


IIUC, the test is using CS to force a failure of refreshAll. If that is indeed 
the case, we could just override the MockRM to use an AdminService that just 
fails the refreshAll? By the way, I haven't started on it yet, so please feel 
free to take it up. 

> TestRMHA#testTransitionedToActiveRefreshFail fails when FairScheduler is the 
> default
> 
>
> Key: YARN-4927
> URL: https://issues.apache.org/jira/browse/YARN-4927
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-3893 adds this test, that relies on some CapacityScheduler-specific 
> stuff for refreshAll to fail, which doesn't apply when using FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery


[ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230003#comment-15230003
 ] 

Bibin A Chundatt commented on YARN-3971:


[~Naganarasimha]
Instead of state based approach have added a flag to identify the initStore 
state and setting the flag once initNodeLabelStore  is completed.
Please do help in review of the same.

> Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
> recovery
> --
>
> Key: YARN-3971
> URL: https://issues.apache.org/jira/browse/YARN-3971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
> 0003-YARN-3971.patch, 0004-YARN-3971.patch, 0005-YARN-3971.addendum.patch, 
> 0005-YARN-3971.patch
>
>
> Steps to reproduce 
> # Create label x,y
> # Delete label x,y
> # Create label x,y add capacity scheduler xml for labels x and y too
> # Restart RM 
>  
> Both RM will become Standby.
> Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
> {code}
> 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
> queue=a1 is using this label. Please remove label on queue before remove the 
> label
> java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
> label. Please remove label on queue before remove the label
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery


 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3971:
---
Attachment: 0005-YARN-3971.addendum.patch

> Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
> recovery
> --
>
> Key: YARN-3971
> URL: https://issues.apache.org/jira/browse/YARN-3971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
> 0003-YARN-3971.patch, 0004-YARN-3971.patch, 0005-YARN-3971.addendum.patch, 
> 0005-YARN-3971.patch
>
>
> Steps to reproduce 
> # Create label x,y
> # Delete label x,y
> # Create label x,y add capacity scheduler xml for labels x and y too
> # Restart RM 
>  
> Both RM will become Standby.
> Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
> {code}
> 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
> queue=a1 is using this label. Please remove label on queue before remove the 
> label
> java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
> label. Please remove label on queue before remove the label
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4562) YARN WebApp ignores the configuration passed to it for keystore settings


[ 
https://issues.apache.org/jira/browse/YARN-4562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229988#comment-15229988
 ] 

Varun Vasudev commented on YARN-4562:
-

+1. I'll commit this tomorrow if no one objects.

> YARN WebApp ignores the configuration passed to it for keystore settings
> 
>
> Key: YARN-4562
> URL: https://issues.apache.org/jira/browse/YARN-4562
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: YARN-4562.patch
>
>
> The conf can be passed to WebApps builder, however the following code in 
> WebApps.java that builds the HttpServer2 object:
> {noformat}
> if (httpScheme.equals(WebAppUtils.HTTPS_PREFIX)) {
>   WebAppUtils.loadSslConfiguration(builder);
> }
> {noformat}
> ...results in loadSslConfiguration creating a new Configuration object; the 
> one that is passed in is ignored, as far as the keystore/etc. settings are 
> concerned.  loadSslConfiguration has another overload with Configuration 
> parameter that should be used instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4928) Some yarn.server.timeline.* tests fail on Windows attempting to use a test root path containing a colon


 [ 
https://issues.apache.org/jira/browse/YARN-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergely Novák updated YARN-4928:

Attachment: YARN-4928.002.patch

> Some yarn.server.timeline.* tests fail on Windows attempting to use a test 
> root path containing a colon
> ---
>
> Key: YARN-4928
> URL: https://issues.apache.org/jira/browse/YARN-4928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
> Environment: OS: Windows Server 2012
> JDK: 1.7.0_79
>Reporter: Gergely Novák
>Assignee: Gergely Novák
>Priority: Minor
> Attachments: YARN-4928.001.patch, YARN-4928.002.patch
>
>
> yarn.server.timeline.TestEntityGroupFSTimelineStore.* and 
> yarn.server.timeline.TestLogInfo.* fail on Windows, because they are 
> attempting to use a test root paths like 
> "/C:/hdp/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/target/test-dir/TestLogInfo",
>  which contains a ":" (after the Windows drive letter) and 
> DFSUtil.isValidName() does not accept paths containing ":".
> This problem is identical to HDFS-6189, so I suggest to use the same 
> approach: using "/tmp/..." as test root dir instead of 
> System.getProperty("test.build.data", System.getProperty("java.io.tmpdir")).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3461) Consolidate flow name/version/run defaults


[ 
https://issues.apache.org/jira/browse/YARN-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229910#comment-15229910
 ] 

Hadoop QA commented on YARN-3461:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 42s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 47s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
54s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 25s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 18s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
7s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 35s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
20s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 8s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 17s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
1s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 19s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 19s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
6s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 4s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 6s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 9s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 28s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 52s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 101m 13s 
{color} | {color:red} hadoop-mapreduce-client-jobclient in the patch failed 
with JDK v1.8.0_77. {color} |
| {color:green}+1{color} | {color:gree

[jira] [Commented] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent


[ 
https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229904#comment-15229904
 ] 

Hadoop QA commented on YARN-4002:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 19s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 55m 45s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 149m 57s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_77 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.webapp.TestRMWithCSRFFilter |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
| JDK v1.8.0_77 Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRM

[jira] [Commented] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.