date:20150417


 [ 
https://issues.apache.org/jira/browse/YARN-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3501:
---
Fix Version/s: (was: 2.3.0)

 problem in running yarn scheduler load simulator
 

 Key: YARN-3501
 URL: https://issues.apache.org/jira/browse/YARN-3501
 Project: Hadoop YARN
  Issue Type: Test
  Components: scheduler-load-simulator
Affects Versions: 2.6.0
 Environment: ubuntu
Reporter: Awadhesh kumar shukla





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3501) problem in running yarn scheduler load simulator


 [ 
https://issues.apache.org/jira/browse/YARN-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3501:
---
Target Version/s:   (was: 2.6.0)

 problem in running yarn scheduler load simulator
 

 Key: YARN-3501
 URL: https://issues.apache.org/jira/browse/YARN-3501
 Project: Hadoop YARN
  Issue Type: Test
  Components: scheduler-load-simulator
Affects Versions: 2.6.0
 Environment: ubuntu
Reporter: Awadhesh kumar shukla





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2


[ 
https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499247#comment-14499247
 ] 

Zhijie Shen commented on YARN-3437:
---

Per my comment on [YARN-3390 | 
https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14499245page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14499245].
 Please feel free to move forward.

 convert load test driver to timeline service v.2
 

 Key: YARN-3437
 URL: https://issues.apache.org/jira/browse/YARN-3437
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3437.001.patch, YARN-3437.002.patch


 This subtask covers the work for converting the proposed patch for the load 
 test driver (YARN-2556) to work with the timeline service v.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3501) problem in running yarn scheduler load simulator

2015-04-17 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499375#comment-14499375
 ] 

Naganarasimha G R commented on YARN-3501:
-

Can some more information be given on this jira !

 problem in running yarn scheduler load simulator
 

 Key: YARN-3501
 URL: https://issues.apache.org/jira/browse/YARN-3501
 Project: Hadoop YARN
  Issue Type: Test
  Components: scheduler-load-simulator
Affects Versions: 2.6.0
 Environment: ubuntu
Reporter: Awadhesh kumar shukla





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499285#comment-14499285
 ] 

Hadoop QA commented on YARN-3463:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726059/YARN-3463.68.patch
  against trunk revision bb6dde6.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7372//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7372//console

This message is automatically generated.

 Integrate OrderingPolicy Framework with CapacityScheduler
 -

 Key: YARN-3463
 URL: https://issues.apache.org/jira/browse/YARN-3463
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3463.50.patch, YARN-3463.61.patch, 
 YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, 
 YARN-3463.67.patch, YARN-3463.68.patch


 Integrate the OrderingPolicy Framework with the CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

2015-04-17 Thread Li Lu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499326#comment-14499326
]

Li Lu commented on YARN-3134:
-

Hi [~djp] and [~zjshen], thanks a lot for the review! I'll fix them pretty soon
and upload a new patch. For now, I'm focusing on correctness, readability, and
exception handling. Does that plan sound good to you? Thanks!

[Storage implementation] Exploiting the option of using Phoenix to access
HBase backend
---

Key: YARN-3134
URL: https://issues.apache.org/jira/browse/YARN-3134
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Li Lu
Attachments: YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch,
YARN-3134-041415_poc.patch, YARN-3134DataSchema.pdf

Quote the introduction on Phoenix web page:
{code}
Apache Phoenix is a relational database layer over HBase delivered as a
client-embedded JDBC driver targeting low latency queries over HBase data.
Apache Phoenix takes your SQL query, compiles it into a series of HBase
scans, and orchestrates the running of those scans to produce regular JDBC
result sets. The table metadata is stored in an HBase table and versioned,
such that snapshot queries over prior versions will automatically use the
correct schema. Direct use of the HBase API, along with coprocessors and
custom filters, results in performance on the order of milliseconds for small
queries, or seconds for tens of millions of rows.
{code}
It may simply our implementation read/write data from/to HBase, and can
easily build index and compose complex query.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3499) Optimize ResourceManager Web loading speed

2015-04-17 Thread Peter Shi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Shi resolved YARN-3499.
-
Resolution: Duplicate

duplicated with YARN-3500

 Optimize ResourceManager Web loading speed
 --

 Key: YARN-3499
 URL: https://issues.apache.org/jira/browse/YARN-3499
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Peter Shi
Priority: Minor

 after running 10k jobs, resoucemanager webui load speed become slow. As 
 server side send 10k jobs information in one response, parsing and rendering 
 page will cost a long time. Current paging logic is done in browser side. 
 This issue makes server side to do the paging logic, so that the loading will 
 be fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

[
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499314#comment-14499314
]

Zhijie Shen commented on YARN-3431:
---

Right, TimelineEntity is the generic Java form for us to compose a timeline
entity in java code, while its corresponding JSON object is the payload during
REST communication. Subclasses of TimelineEntity are defined to facilitate
us/users to easily manipulate some predefined, specific attributes.

bq. My main problem is with the prototype field of TimelineEntity.

Maybe I should change prototype to real. After receiving the entity from
the endpoint of the web server, not matter it was the generic TimelineEntity or
the subclass object, it will be deserialized as TimelineEntity object. If it
was the subclass object, the content is preserved, but the Java class hierarchy
is lost after deserialization. However, we can use TimelineEntity and its type
to construct the right subclass object in a *proxy* way.

bq. For HierarchicalTimelineEntity, seems like we're not adding any special
tags when we addIsRelatedToEntity() in setParent()

Yeah, relates to/ is related to is used to construct a directed graph among
entities. Parent-child relationship is a tree, which can be described by
relates to/ is related.

bq. Are we prohibiting the users from using isRelatedToEntities in
HierarchicalTimelineEntity completely to avoid problems?

Sounds good. I used to think about it, but not include it in this patch.

bq. , I'm not sure if we really need the subclass information.

I'm not pretty sure, but I guess we may probably not need the subclasses' Java
APIs, and that's why I put a comment there. However, since it's not a big
overhead given the way we construct the subclass object, I prefer to leave the
code there, in case we want subclass APIs somewhere (e.g., aggregation).

There're two additional bugs in this patch. I'll fix the outstanding issues and
upload a new one later.

Sub resources of timeline entity needs to be passed to a separate endpoint.
---

Key: YARN-3431
URL: https://issues.apache.org/jira/browse/YARN-3431
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch

We have TimelineEntity and some other entities as subclass that inherit from
it. However, we only have a single endpoint, which consume TimelineEntity
rather than sub-classes and this endpoint will check the incoming request
body contains exactly TimelineEntity object. However, the json data which is
serialized from sub-class object seems not to be treated as an TimelineEntity
object, and won't be deserialized into the corresponding sub-class object
which cause deserialization failure as some discussions in YARN-3334 :
https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed


[ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499288#comment-14499288
 ] 

Hadoop QA commented on YARN-3493:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726049/YARN-3493.3.patch
  against trunk revision bb6dde6.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7373//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7373//console

This message is automatically generated.

 RM fails to come up with error Failed to load/recover state when  mem 
 settings are changed
 

 Key: YARN-3493
 URL: https://issues.apache.org/jira/browse/YARN-3493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
 yarn-yarn-resourcemanager.log.zip


 RM fails to come up for the following case:
 1. Change yarn.nodemanager.resource.memory-mb and 
 yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
 background and wait for the job to reach running state
 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
 before the above job completes
 4. Restart RM
 5. RM fails to come up with the below error
 {code:title= RM error for Mem settings changed}
  - RM app submission failed in validating AM resource request for application 
 application_1429094976272_0008
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=3072, maxMemory=2048
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
 at

[jira] [Created] (YARN-3501) problem in running yarn scheduler load simulator

2015-04-17 Thread Awadhesh kumar shukla (JIRA)

Awadhesh kumar shukla created YARN-3501:
---

 Summary: problem in running yarn scheduler load simulator
 Key: YARN-3501
 URL: https://issues.apache.org/jira/browse/YARN-3501
 Project: Hadoop YARN
  Issue Type: Test
  Components: scheduler-load-simulator
Affects Versions: 2.6.0
 Environment: ubuntu
Reporter: Awadhesh kumar shukla
 Fix For: 2.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily


[ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499393#comment-14499393
 ] 

Sunil G commented on YARN-3487:
---

Hi [~leftnoteasy]
I am sorry for providing lesser content earlier. After seeing your comment 
again, i could see that my comment also was going on same line.

Runtime updates can add or change some CLIs for a Queue. So if synchronized 
keyword s removed, checkAccess is open and some checks may pass/fail as per the 
partial information available for CLI of Queue.
So we may run into partial errors which is a race case condition. 

 CapacityScheduler scheduler lock obtained unnecessarily
 ---

 Key: YARN-3487
 URL: https://issues.apache.org/jira/browse/YARN-3487
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-3487.001.patch, YARN-3487.002.patch


 Recently saw a significant slowdown of applications on a large cluster, and 
 we noticed there were a large number of blocked threads on the RM.  Most of 
 the blocked threads were waiting for the CapacityScheduler lock while calling 
 getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

[
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499264#comment-14499264
]

Zhijie Shen commented on YARN-3134:
---

Some thoughts about backend POC, not just limited to Phoenix writer, but HBase
writer too.

1. At the current stage, I suggest we focus on logic correctness and
performance tuning. We may have multiple iterations between improving and doing
benchmark.

2. At the beginning, we may not implement storing everything of timeline entity
(such as relationship), but we should at lease make sure what Phoenix writer
and HBase writer have implemented are identical in terms of the data to store.

3. It's good if we can have rich test suites like TimelineStoreTestUtils to
ensure the robustness of the writer. Moreover, it's black box testing, and we
can use them to check if Phoenix writer and HBase writer behave the same.

/cc [~vrushalic]

For Phoenix implementation only:

I used Phoenix writer for a real deployment, and I could see the implementation
is not thread safe. ConcurrentModificatioException will be thrown upon
committing the statements.

[Storage implementation] Exploiting the option of using Phoenix to access
HBase backend
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue


[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499399#comment-14499399
 ] 

Sunil G commented on YARN-1963:
---

Yes. We could try support both Integer and Label (with mappings). We may open 
independent Jiras to handle this case (have both patches, will sync up as one) 
, and which should achieve the same goal w/o complexity. And we will look for 
simpler version for now, not complex re-mappings etc. 

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G
 Attachments: 0001-YARN-1963-prototype.patch, YARN Application 
 Priorities Design.pdf, YARN Application Priorities Design_01.pdf


 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler

2015-04-17 Thread Craig Welch (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-3463:
--
Attachment: YARN-3463.68.patch

 Integrate OrderingPolicy Framework with CapacityScheduler
 -

 Key: YARN-3463
 URL: https://issues.apache.org/jira/browse/YARN-3463
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3463.50.patch, YARN-3463.61.patch, 
 YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, 
 YARN-3463.67.patch, YARN-3463.68.patch


 Integrate the OrderingPolicy Framework with the CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler

2015-04-17 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499205#comment-14499205
 ] 

Craig Welch commented on YARN-3463:
---

btw, the tests pass on my box with the change, failures not related to the patch

 Integrate OrderingPolicy Framework with CapacityScheduler
 -

 Key: YARN-3463
 URL: https://issues.apache.org/jira/browse/YARN-3463
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3463.50.patch, YARN-3463.61.patch, 
 YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, 
 YARN-3463.67.patch, YARN-3463.68.patch


 Integrate the OrderingPolicy Framework with the CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM


[ 
https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499245#comment-14499245
 ] 

Zhijie Shen commented on YARN-3390:
---

The only conflict part between YARN-3437 and this Jira is 
TimelineCollectorManager base class. And we happened to resort to the similar 
refactoring method. I'm okay to commit YARN-3437 first. However, the comments 
about the base TimelineCollectorManager also apply. At least, I think we should 
use ApplicationId instead of String.

 Reuse TimelineCollectorManager for RM
 -

 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3390.1.patch


 RMTimelineCollector should have the context info of each app whose entity  
 has been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3500) Optimize ResourceManager Web loading speed

2015-04-17 Thread Peter Shi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499297#comment-14499297
 ] 

Peter Shi commented on YARN-3500:
-

Yes, i have mark 3499 with duplicate


 Optimize ResourceManager Web loading speed
 --

 Key: YARN-3500
 URL: https://issues.apache.org/jira/browse/YARN-3500
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Peter Shi
Priority: Minor

 after running 10k jobs, resoucemanager webui load speed become slow. As 
 server side send 10k jobs information in one response, parsing and rendering 
 page will cost a long time. Current paging logic is done in browser side. 
 This issue makes server side to do the paging logic, so that the loading will 
 be fast.
 Loading 10k jobs costs 55 sec. loading 2k costs 7 sec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

2015-04-17 Thread Li Lu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499398#comment-14499398
]

Li Lu commented on YARN-3134:
-

Hi [~zjshen] could you please provide some more information to reproduce the
failures? Or, the exception stack would also be helpful. I'm trying to setup a
deployment but would like to make sure we're seeing consistent problems.
Thanks!

[Storage implementation] Exploiting the option of using Phoenix to access
HBase backend
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499413#comment-14499413
 ] 

Li Lu commented on YARN-3431:
-

bq. After receiving the entity from the endpoint of the web server, not matter 
it was the generic TimelineEntity or the subclass object, it will be 
deserialized as TimelineEntity object. If it was the subclass object, the 
content is preserved, but the Java class hierarchy is lost after 
deserialization. However, we can use TimelineEntity and its type to construct 
the right subclass object in a proxy way.

OK, I agree the current design would save us one deep copy every time we 
receive a timeline entity. I'm still thinking about an appropriate name for the 
prototype field to better represent its nature...

bq. For HierarchicalTimelineEntity, seems like we're not adding any special 
tags when we addIsRelatedToEntity() in setParent()
bq. Yeah, relates to/ is related to is used to construct a directed graph among 
entities. Parent-child relationship is a tree, which can be described by 
relates to/ is related.
bq. Are we prohibiting the users from using isRelatedToEntities in 
HierarchicalTimelineEntity completely to avoid problems?
bq. Sounds good. I used to think about it, but not include it in this patch.

That sounds good. It would be very helpful to explicitly prohibit direct usages 
of isRelatedToEntities and relatesToEntities IMHO. 

 Sub resources of timeline entity needs to be passed to a separate endpoint.
 ---

 Key: YARN-3431
 URL: https://issues.apache.org/jira/browse/YARN-3431
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch


 We have TimelineEntity and some other entities as subclass that inherit from 
 it. However, we only have a single endpoint, which consume TimelineEntity 
 rather than sub-classes and this endpoint will check the incoming request 
 body contains exactly TimelineEntity object. However, the json data which is 
 serialized from sub-class object seems not to be treated as an TimelineEntity 
 object, and won't be deserialized into the corresponding sub-class object 
 which cause deserialization failure as some discussions in YARN-3334 : 
 https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly

2015-04-17 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499435#comment-14499435
 ] 

Steve Loughran commented on YARN-2605:
--

patch looks good in production; 307 is the error code rest apps need; these 
will ignore the text so that can stay human-readable.

Why is a test now tagged as @ignore?

 [RM HA] Rest api endpoints doing redirect incorrectly
 -

 Key: YARN-2605
 URL: https://issues.apache.org/jira/browse/YARN-2605
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: bc Wong
Assignee: Xuan Gong
  Labels: newbie
 Attachments: YARN-2605.1.patch


 The standby RM's webui tries to do a redirect via meta-refresh. That is fine 
 for pages designed to be viewed by web browsers. But the API endpoints 
 shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd 
 suggest HTTP 303, or return a well-defined error message (json or xml) 
 stating that the standby status and a link to the active RM.
 The standby RM is returning this today:
 {noformat}
 $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics
 HTTP/1.1 200 OK
 Cache-Control: no-cache
 Expires: Thu, 25 Sep 2014 18:34:53 GMT
 Date: Thu, 25 Sep 2014 18:34:53 GMT
 Pragma: no-cache
 Expires: Thu, 25 Sep 2014 18:34:53 GMT
 Date: Thu, 25 Sep 2014 18:34:53 GMT
 Pragma: no-cache
 Content-Type: text/plain; charset=UTF-8
 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
 Content-Length: 117
 Server: Jetty(6.1.26)
 This is standby RM. Redirecting to the current active RM: 
 http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3503) Expose disk utilization percentage on NM via JMX

Varun Vasudev created YARN-3503:
---

 Summary: Expose disk utilization percentage on NM via JMX
 Key: YARN-3503
 URL: https://issues.apache.org/jira/browse/YARN-3503
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3503) Expose disk utilization percentage on NM via JMX


 [ 
https://issues.apache.org/jira/browse/YARN-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-3503:

Description: It would be useful to expose the disk utilization on the NMs 
via JMX so that alerts can be setup for nodes.

 Expose disk utilization percentage on NM via JMX
 

 Key: YARN-3503
 URL: https://issues.apache.org/jira/browse/YARN-3503
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev

 It would be useful to expose the disk utilization on the NMs via JMX so that 
 alerts can be setup for nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3496) Add a configuration to disable/enable storing localization state in NMLeveldbStateStore


[ 
https://issues.apache.org/jira/browse/YARN-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499537#comment-14499537
 ] 

zhihai xu commented on YARN-3496:
-

Hi [~jlowe], You are right. Based on my profiling at YARN-3491, The levelDb's 
overhead is minor. thanks

 Add a configuration to disable/enable storing localization state in 
 NMLeveldbStateStore
 ---

 Key: YARN-3496
 URL: https://issues.apache.org/jira/browse/YARN-3496
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu

 Add a configuration to disable/enable storing localization state in 
 NMLeveldbStateStore.
 Store Localization state in the levelDB may have some overhead, which may 
 affect NM performance.
 It would better to have a configuration to disable/enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Description: 
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which is about 10 ms.
The total delay will be approximately number of local dirs * 10 ms.
This delay will be added for each public resource localization.
It will cause public resource localization is serialized most of the time.

  was:
Improve the public resource localization to do both FSDownload submission to 
the thread pool and completed localization handling in one thread 
(PublicLocalizer).
Currently FSDownload submission to the thread pool is done in 
PublicLocalizer#addResource which is running in Dispatcher thread and completed 
localization handling is done in PublicLocalizer#run which is running in 
PublicLocalizer thread.
Because PublicLocalizer#addResource is time consuming, the thread pool can't be 
fully utilized. Instead of doing public resource localization in 
parallel(multithreading), public resource localization is serialized most of 
the time.

Also there are two more benefits with this change:
1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . 
Dispatcher thread handles most of time critical events at Node manager.
2. don't need synchronization on HashMap (pending).
Because pending will be only accessed in PublicLocalizer thread.


 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which is about 10 ms.
 The total delay will be approximately number of local dirs * 10 ms.
 This delay will be added for each public resource localization.
 It will cause public resource localization is serialized most of the time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]


[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499401#comment-14499401
 ] 

Sunil G commented on YARN-2003:
---

HI [~leftnoteasy]
bq. authenticateApplicationPriority, I'm wondering if we really need it.

I understand your point. But we may fire a new APP_ADDED event from 
RmAppManager to respective scheduler with a unapproved priority, and then 
reject from there.
If we can do much earlier with help of a single api, it may avoid some extra 
event handling in the case of wrong(invalid priority) app submission.

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
 0006-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.


[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499534#comment-14499534
 ] 

zhihai xu commented on YARN-3491:
-

Hi [~jlowe], You are right, I am really sorry all my previous guesses are wrong.
I did the profiling and I find out the bottleneck is at the following code
{code}
getInitializedLocalDirs();
getInitializedLogDirs();
{code}

More accurately the bottleneck is at checkLocalDir which call getFileStatus.
I did two round profiling:
1.I measure the time in PublicLocalizer#addResource:
the following code include levelDB operation take 1 ms.
{code}
Path publicRootPath =
dirsHandler.getLocalPathForWrite(. + Path.SEPARATOR
+ ContainerLocalizer.FILECACHE,
  ContainerLocalizer.getEstimatedSize(resource), true);
Path publicDirDestPath =
publicRsrc.getPathForLocalization(key, publicRootPath);
if (!publicDirDestPath.getParent().equals(publicRootPath)) {
  DiskChecker.checkDir(new 
File(publicDirDestPath.toUri().getPath()));
}
{code}

getInitializedLocalDirs and getInitializedLogDirs take 12 ms together

And the following queue.submit code take less than 1 ms.
{code}
synchronized (pending) {
  pending.put(queue.submit(new FSDownload(lfs, null, conf,
  publicDirDestPath, resource, 
request.getContext().getStatCache())),
  request);
}
{code}

2. then I measure the time in getInitializedLocalDirs and getInitializedLogDirs.
I find out checkLocalDir is really slow which is called by 
getInitializedLocalDirs.
checkLocalDir takes 14 ms. There is only one local Dir in my test environment.
{code}
  synchronized private ListString getInitializedLocalDirs() {
ListString dirs = dirsHandler.getLocalDirs();
ListString checkFailedDirs = new ArrayListString();
for (String dir : dirs) {
  try {
checkLocalDir(dir);
  } catch (YarnRuntimeException e) {
checkFailedDirs.add(dir);
  }
}
{code}

The log in my previous comment has more than 10 local Dirs, which will call 
checkLocalDir more than 10 times
10 * 14 is about 100+ms, So I find out where the 100+ms delay come from.

I attached a patch YARN-3491.000.patch to fix the issue, The patch will call 
getInitializedLocalDirs only once for each container.
The original code will call getInitializedLocalDirs for each public resource. 
Each container can have hundreds of public resource, which is the situation in 
my previous log.

[~jlowe], Could you review it? thanks


 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

 Improve the public resource localization to do both FSDownload submission to 
 the thread pool and completed localization handling in one thread 
 (PublicLocalizer).
 Currently FSDownload submission to the thread pool is done in 
 PublicLocalizer#addResource which is running in Dispatcher thread and 
 completed localization handling is done in PublicLocalizer#run which is 
 running in PublicLocalizer thread.
 Because PublicLocalizer#addResource is time consuming, the thread pool can't 
 be fully utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 Also there are two more benefits with this change:
 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . 
 Dispatcher thread handles most of time critical events at Node manager.
 2. don't need synchronization on HashMap (pending).
 Because pending will be only accessed in PublicLocalizer thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits

2015-04-17 Thread Gururaj Shetty (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499517#comment-14499517
 ] 

Gururaj Shetty commented on YARN-3261:
--

Hi [~aw]

Kindly review the update patch and let me know if I need to change anything.

Thanks  Regards,
Gururaj

 rewrite resourcemanager restart doc to remove roadmap bits 
 ---

 Key: YARN-3261
 URL: https://issues.apache.org/jira/browse/YARN-3261
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Allen Wittenauer
Assignee: Gururaj Shetty
 Attachments: YARN-3261.01.patch


 Another mixture of roadmap and instruction manual that seems to be ever 
 present in a lot of the recently written documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Description: 
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which is about 10 ms.
The total delay will be approximately number of local dirs * 10 ms.
This delay will be added for each public resource localization.
Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
utilized. Instead of doing public resource localization in 
parallel(multithreading), public resource localization is serialized most of 
the time.

And also PublicLocalizer#addResource is running in Dispatcher thread, 
So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
long time.

  was:
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which is about 10 ms.
The total delay will be approximately number of local dirs * 10 ms.
This delay will be added for each public resource localization.
It will cause public resource localization is serialized most of the time.
And also PublicLocalizer#addResource is running in Dispatcher thread, 
So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
long time.


 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which is about 10 ms.
 The total delay will be approximately number of local dirs * 10 ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Description: 
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which is about 10 ms.
The total delay will be approximately number of local dirs * 10 ms.
This delay will be added for each public resource localization.
It will cause public resource localization is serialized most of the time.
And also PublicLocalizer#addResource is running in Dispatcher thread, 
So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
long time.

  was:
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which is about 10 ms.
The total delay will be approximately number of local dirs * 10 ms.
This delay will be added for each public resource localization.
It will cause public resource localization is serialized most of the time.


 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which is about 10 ms.
 The total delay will be approximately number of local dirs * 10 ms.
 This delay will be added for each public resource localization.
 It will cause public resource localization is serialized most of the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3502) Expose number of unhealthy disks on NM via JMX


 [ 
https://issues.apache.org/jira/browse/YARN-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-3502:

Description: It would be useful to expose the number of unhealthy disks on 
the NMs via JM so that alerts can be setup for the nodes.

 Expose number of unhealthy disks on NM via JMX
 --

 Key: YARN-3502
 URL: https://issues.apache.org/jira/browse/YARN-3502
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev

 It would be useful to expose the number of unhealthy disks on the NMs via JM 
 so that alerts can be setup for the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Attachment: YARN-3491.000.patch

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch


 Improve the public resource localization to do both FSDownload submission to 
 the thread pool and completed localization handling in one thread 
 (PublicLocalizer).
 Currently FSDownload submission to the thread pool is done in 
 PublicLocalizer#addResource which is running in Dispatcher thread and 
 completed localization handling is done in PublicLocalizer#run which is 
 running in PublicLocalizer thread.
 Because PublicLocalizer#addResource is time consuming, the thread pool can't 
 be fully utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 Also there are two more benefits with this change:
 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . 
 Dispatcher thread handles most of time critical events at Node manager.
 2. don't need synchronization on HashMap (pending).
 Because pending will be only accessed in PublicLocalizer thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]


[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499402#comment-14499402
 ] 

Sunil G commented on YARN-2003:
---

Thank you [~leftnoteasy] for sharing comments. I will rebase patch and will 
address the comments mentioned in above comment

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
 0006-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3502) Expose number of unhealthy disks on NM via JMX

Varun Vasudev created YARN-3502:
---

 Summary: Expose number of unhealthy disks on NM via JMX
 Key: YARN-3502
 URL: https://issues.apache.org/jira/browse/YARN-3502
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3496) Add a configuration to disable/enable storing localization state in NMLeveldbStateStore


 [ 
https://issues.apache.org/jira/browse/YARN-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3496.
-
Resolution: Not A Problem

 Add a configuration to disable/enable storing localization state in 
 NMLeveldbStateStore
 ---

 Key: YARN-3496
 URL: https://issues.apache.org/jira/browse/YARN-3496
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu

 Add a configuration to disable/enable storing localization state in 
 NMLeveldbStateStore.
 Store Localization state in the levelDB may have some overhead, which may 
 affect NM performance.
 It would better to have a configuration to disable/enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend

[
https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499764#comment-14499764
]

Junping Du commented on YARN-3134:
--

bq. 1. At the current stage, I suggest we focus on logic correctness and
performance tuning. We may have multiple iterations between improving and doing
benchmark
+1. We should get some performance data which help us better understanding on
the direction and priority.

bq. For now, I'm focusing on correctness, readability, and exception handling.
Does that plan sound good to you?
Sounds like a good plan. Thanks [~gtCarrera9].

[Storage implementation] Exploiting the option of using Phoenix to access
HBase backend
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3381) A typographical error in InvalidStateTransitonException


 [ 
https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3381:
---
Attachment: YARN-3381-002.patch

 A typographical error in InvalidStateTransitonException
 -

 Key: YARN-3381
 URL: https://issues.apache.org/jira/browse/YARN-3381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Xiaoshuang LU
Assignee: Brahma Reddy Battula
 Attachments: YARN-3381-002.patch, YARN-3381.patch


 Appears that InvalidStateTransitonException should be 
 InvalidStateTransitionException.  Transition was misspelled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running


[ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499750#comment-14499750
 ] 

Hadoop QA commented on YARN-2268:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726128/0001-YARN-2268.patch
  against trunk revision 76e7264.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 1 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/7375//artifact/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRM

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7375//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/7375//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7375//console

This message is automatically generated.

 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: 0001-YARN-2268.patch


 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499609#comment-14499609
 ] 

Hudson commented on YARN-3021:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #166 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/166/])
YARN-3021. YARN's delegation-token handling disallows certain trust setups to 
operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev 
bb6dde68f19be1885a9e7f7949316a03825b6f3e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3181) FairScheduler: Fix up outdated findbugs issues

2015-04-17 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499667#comment-14499667
 ] 

Tsuyoshi Ozawa commented on YARN-3181:
--

I agree with Karthik's comment - this is not urgent concern. In fact, 
findbugs-exclude.xml is not outdated since FairScheduler does some optimization 
which findbugs warns for getting better concurrency. On this JIRA, we should 
check the order of lock very carefully without degrading performance. As a 
result, it would be better we can remove  IS2_INCONSISTENT_SYNC exclusion from 
findbugs-exclude file.

 FairScheduler: Fix up outdated findbugs issues
 --

 Key: YARN-3181
 URL: https://issues.apache.org/jira/browse/YARN-3181
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Brahma Reddy Battula
 Attachments: YARN-3181-002.patch, yarn-3181-1.patch


 In FairScheduler, we have excluded some findbugs-reported errors. Some of 
 them aren't applicable anymore, and there are a few that can be easily fixed 
 without needing an exclusion. It would be nice to fix them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3495) Confusing log generated by FairScheduler

2015-04-17 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499702#comment-14499702
 ] 

Tsuyoshi Ozawa commented on YARN-3495:
--

Oops, I meant discussion on YARN-3197.

 Confusing log generated by FairScheduler
 

 Key: YARN-3495
 URL: https://issues.apache.org/jira/browse/YARN-3495
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
 Attachments: YARN-3495.patch


 2015-04-16 12:03:48,531 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-17 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499803#comment-14499803
 ] 

Thomas Graves commented on YARN-3434:
-

Ok, I'll make the changes and post an updated patch

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3381) A typographical error in InvalidStateTransitonException


[ 
https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499836#comment-14499836
 ] 

Hadoop QA commented on YARN-3381:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726152/YARN-3381-002.patch
  against trunk revision 76e7264.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7376//console

This message is automatically generated.

 A typographical error in InvalidStateTransitonException
 -

 Key: YARN-3381
 URL: https://issues.apache.org/jira/browse/YARN-3381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Xiaoshuang LU
Assignee: Brahma Reddy Battula
 Attachments: YARN-3381-002.patch, YARN-3381.patch


 Appears that InvalidStateTransitonException should be 
 InvalidStateTransitionException.  Transition was misspelled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3381) A typographical error in InvalidStateTransitonException


[ 
https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499797#comment-14499797
 ] 

Brahma Reddy Battula commented on YARN-3381:


Rebased the patch.. Kindly review..

 A typographical error in InvalidStateTransitonException
 -

 Key: YARN-3381
 URL: https://issues.apache.org/jira/browse/YARN-3381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Xiaoshuang LU
Assignee: Brahma Reddy Battula
 Attachments: YARN-3381-002.patch, YARN-3381.patch


 Appears that InvalidStateTransitonException should be 
 InvalidStateTransitionException.  Transition was misspelled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.


[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499580#comment-14499580
 ] 

Hadoop QA commented on YARN-3491:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726118/YARN-3491.000.patch
  against trunk revision 76e7264.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7374//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7374//console

This message is automatically generated.

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which is about 10 ms.
 The total delay will be approximately number of local dirs * 10 ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running


[ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499641#comment-14499641
 ] 

Rohith commented on YARN-2268:
--

I verified the patch deploying in HA cluster and Non-HA cluster. On any active 
RM is found in the cluster then exeption will be thrown back to console

 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: 0001-YARN-2268.patch


 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499684#comment-14499684
 ] 

Hudson commented on YARN-3021:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #157 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/157/])
YARN-3021. YARN's delegation-token handling disallows certain trust setups to 
operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev 
bb6dde68f19be1885a9e7f7949316a03825b6f3e)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499621#comment-14499621
 ] 

Hudson commented on YARN-3021:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #900 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/900/])
YARN-3021. YARN's delegation-token handling disallows certain trust setups to 
operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev 
bb6dde68f19be1885a9e7f7949316a03825b6f3e)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* hadoop-yarn-project/CHANGES.txt


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499675#comment-14499675
 ] 

Hudson commented on YARN-3021:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2098 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2098/])
YARN-3021. YARN's delegation-token handling disallows certain trust setups to 
operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev 
bb6dde68f19be1885a9e7f7949316a03825b6f3e)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3495) Confusing log generated by FairScheduler

2015-04-17 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499701#comment-14499701
 ] 

Tsuyoshi Ozawa commented on YARN-3495:
--

+1. 

I checked the discussion YARN-3495, and the contents of this log looks good to 
me. I also checked that containerStatus cannot be null any case. I'll commit 
this 2 days after.

 Confusing log generated by FairScheduler
 

 Key: YARN-3495
 URL: https://issues.apache.org/jira/browse/YARN-3495
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
 Attachments: YARN-3495.patch


 2015-04-16 12:03:48,531 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running


 [ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2268:
-
Issue Type: Improvement  (was: Bug)

 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith

 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running


 [ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2268:
-
Attachment: 0001-YARN-2268.patch

 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: 0001-YARN-2268.patch


 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running


[ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499638#comment-14499638
 ] 

Rohith commented on YARN-2268:
--

Attached the patch for disallowing format store using previous approach. Kindly 
review the patch

 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: 0001-YARN-2268.patch


 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3503) Expose disk utilization percentage on NM via JMX

2015-04-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499885#comment-14499885
 ] 

Vinod Kumar Vavilapalli commented on YARN-3503:
---

Given we are starting afresh on exposing resource-usage, how about we make this 
a REST API and merge it into YARN-3332?

 Expose disk utilization percentage on NM via JMX
 

 Key: YARN-3503
 URL: https://issues.apache.org/jira/browse/YARN-3503
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev

 It would be useful to expose the disk utilization on the NMs via JMX so that 
 alerts can be setup for nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2015-04-17 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499939#comment-14499939
 ] 

Arun Suresh commented on YARN-2962:
---

[~varun_saxena], wondering if you need any help with this. Would like to get 
this in soon.

 ZKRMStateStore: Limit the number of znodes under a znode
 

 Key: YARN-2962
 URL: https://issues.apache.org/jira/browse/YARN-2962
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-2962.01.patch


 We ran into this issue where we were hitting the default ZK server message 
 size configs, primarily because the message had too many znodes even though 
 they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1442#comment-1442
 ] 

Hudson commented on YARN-3021:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #167 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/167/])
YARN-3021. YARN's delegation-token handling disallows certain trust setups to 
operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev 
bb6dde68f19be1885a9e7f7949316a03825b6f3e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1402) Related Web UI, CLI changes on exposing client API to check log aggregation status


[ 
https://issues.apache.org/jira/browse/YARN-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500019#comment-14500019
 ] 

Junping Du commented on YARN-1402:
--

Latest patch looks good to me. [~xgong], can you file a separated JIRA to track 
test failure in case we don't have one?

 Related Web UI, CLI changes on exposing client API to check log aggregation 
 status
 --

 Key: YARN-1402
 URL: https://issues.apache.org/jira/browse/YARN-1402
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-1402.1.patch, YARN-1402.2.patch, 
 YARN-1402.3.1.patch, YARN-1402.3.2.patch, YARN-1402.3.patch, YARN-1402.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Inigo Goiri (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500287#comment-14500287
 ] 

Inigo Goiri commented on YARN-3482:
---

Yes, that one is good. My proposal for the third one was meaningless...
I'll go code this.

 Report NM available resources in heartbeat
 --

 Key: YARN-3482
 URL: https://issues.apache.org/jira/browse/YARN-3482
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
   Original Estimate: 504h
  Remaining Estimate: 504h

 NMs are usually collocated with other processes like HDFS, Impala or HBase. 
 To manage this scenario correctly, YARN should be aware of the actual 
 available resources. The proposal is to have an interface to dynamically 
 change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore


 [ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3410:
-
Attachment: 0004-YARN-3410.patch

Updated the patch fixing usage format.. kindly review updated patch

 YARN admin should be able to remove individual application records from 
 RMStateStore
 

 Key: YARN-3410
 URL: https://issues.apache.org/jira/browse/YARN-3410
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, yarn
Reporter: Wangda Tan
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, 
 0004-YARN-3410.patch


 When RM state store entered an unexpected state, one example is YARN-2340, 
 when an attempt is not in final state but app already completed, RM can never 
 get up unless format RMStateStore.
 I think we should support remove individual application records from 
 RMStateStore to unblock RM admin make choice of either waiting for a fix or 
 format state store.
 In addition, RM should be able to report all fatal errors (which will 
 shutdown RM) when doing app recovery, this can save admin some time to remove 
 apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration


[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500257#comment-14500257
 ] 

Hadoop QA commented on YARN-3136:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726183/00011-YARN-3136.patch
  against trunk revision 76e7264.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7377//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/7377//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7377//console

This message is automatically generated.

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
 00011-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 
 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 
 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500277#comment-14500277
 ] 

Karthik Kambatla commented on YARN-3482:


For the third config, would something like 
yarn.nodemanager.dynamic-resource-availability=true/false be more descriptive? 

Admin interface (with a special command) sounds reasonable. 

 Report NM available resources in heartbeat
 --

 Key: YARN-3482
 URL: https://issues.apache.org/jira/browse/YARN-3482
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
   Original Estimate: 504h
  Remaining Estimate: 504h

 NMs are usually collocated with other processes like HDFS, Impala or HBase. 
 To manage this scenario correctly, YARN should be aware of the actual 
 available resources. The proposal is to have an interface to dynamically 
 change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration


 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3136:
--
Attachment: 00012-YARN-3136.patch

Rebased against trunk. Also changed the findbugs suppression for 
getTransferredContainers method.

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, 
 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 
 0009-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500308#comment-14500308
 ] 

Hadoop QA commented on YARN-3410:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726185/0003-YARN-3410.patch
  against trunk revision 76e7264.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7378//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7378//console

This message is automatically generated.

 YARN admin should be able to remove individual application records from 
 RMStateStore
 

 Key: YARN-3410
 URL: https://issues.apache.org/jira/browse/YARN-3410
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, yarn
Reporter: Wangda Tan
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch


 When RM state store entered an unexpected state, one example is YARN-2340, 
 when an attempt is not in final state but app already completed, RM can never 
 get up unless format RMStateStore.
 I think we should support remove individual application records from 
 RMStateStore to unblock RM admin make choice of either waiting for a fix or 
 format state store.
 In addition, RM should be able to report all fatal errors (which will 
 shutdown RM) when doing app recovery, this can save admin some time to remove 
 apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Sangjin Lee (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500398#comment-14500398
]

Sangjin Lee commented on YARN-3431:
---

I know [~zjshen]'s updating the patch, but I'll provide some feedback based on
the current patch and the discussion here. Generally I agree with the approach
of using fields in TimelineEntity to store/retrieve specialized information.
That would definitely help with the JSON's (lack of) support for polymorphism.

With regards to parent-child relationship and the relationship in general, this
might be some change, but would it be better to have some kind of a key or a
label for a relationship? It would help locate the particular relationship
(e.g. parent) quickly, and help other use cases in identifying exactly the
relationship it needs to retrieve. Thoughts?

On a related note, I have problems with prohibiting hierarchical timeline
entities from having any other relationships than parent-child. For example,
frameworks (e.g. mapreduce) may use hierarchical timeline entities to describe
their hierarchy (job = task = task attempts), and these entities would have
dotted lines to YARN system entities (app, containers, etc.) and vice versa. It
would be a pretty severe restriction to prohibit them. If we adopt the above
approach, we should be able to allow both, right?

(FlowEntity.java)
- l. 58: do we want to set the id once we calculate it from scratch?

(TimelineEntity.java)
- l.88: Some javadoc would be helpful in explaining this constructor. It
doesn't come through as very obvious.

Sub resources of timeline entity needs to be passed to a separate endpoint.
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500395#comment-14500395
 ] 

Sunil G commented on YARN-3482:
---

Hi [~elgoiri]
bq.  better to report the resources utilized by the machine.

Do you mean Total CPU, and Total Memory etc. 
Could you please elaborate how this can help in doing a better resource 
allotment. 

As I see, if affinity is not set in CPU, distribution will be more generic and 
it may not be so easy to derive from that.

 Report NM available resources in heartbeat
 --

 Key: YARN-3482
 URL: https://issues.apache.org/jira/browse/YARN-3482
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
   Original Estimate: 504h
  Remaining Estimate: 504h

 NMs are usually collocated with other processes like HDFS, Impala or HBase. 
 To manage this scenario correctly, YARN should be aware of the actual 
 available resources. The proposal is to have an interface to dynamically 
 change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.


[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500405#comment-14500405
 ] 

zhihai xu commented on YARN-3491:
-

I uploaded a new patch YARN-3491.001.patch for review 
I think a little bit deeper, The old patch may have a big delay if multiple 
containers are submitted at the same time.
For example the following log shows 4 containers submitted at very close time:
{code}
2015-04-07 21:42:22,071 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110648_01_078264 transitioned from NEW to 
LOCALIZING
2015-04-07 21:42:22,074 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110652_01_093777 transitioned from NEW to 
LOCALIZING
2015-04-07 21:42:22,076 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110668_01_049049 transitioned from NEW to 
LOCALIZING
2015-04-07 21:42:22,078 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110668_01_085183 transitioned from NEW to 
LOCALIZING
{code}
The new patch can overlap the delay with public localization from previous 
container, which will be a little bit better and more consistent with the 
behavior in the old code.
Also It will be better for the container which only has private resource and no 
public resource. For this case, no delay will be added to Dispatcher thread.
Finally the change in new patch is a little bit smaller than the first patch.

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]


[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500417#comment-14500417
 ] 

Hadoop QA commented on YARN-2003:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726222/0007-YARN-2003.patch
  against trunk revision c6b5203.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7382//console

This message is automatically generated.

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
 0006-YARN-2003.patch, 0007-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore

[
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500338#comment-14500338
]

Wangda Tan commented on YARN-3410:
--

bq. Yes, in the same user two RM can not be started. It check for PID and fail
it. YARN-2268 disallows the formatting state store while RM is running. The
same verification can be made for this also in that JIRA
Yes we should, it's the same problem.

The latest patch LGTM, +1.

YARN admin should be able to remove individual application records from
RMStateStore

Key: YARN-3410
URL: https://issues.apache.org/jira/browse/YARN-3410
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager, yarn
Reporter: Wangda Tan
Assignee: Rohith
Priority: Critical
Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch,
0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch,
0004-YARN-3410.patch

When RM state store entered an unexpected state, one example is YARN-2340,
when an attempt is not in final state but app already completed, RM can never
get up unless format RMStateStore.
I think we should support remove individual application records from
RMStateStore to unblock RM admin make choice of either waiting for a fix or
format state store.
In addition, RM should be able to report all fatal errors (which will
shutdown RM) when doing app recovery, this can save admin some time to remove
apps in bad state.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily


[ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500342#comment-14500342
 ] 

Wangda Tan commented on YARN-3487:
--

Thanks for feedback from [~sunilg], [~jlowe].

Make this as a sub JIRA of YARN-3091, and w/r lock for CS is tracked by 
YARN-3139.

The latest patch LGTM, will commit when Jenkins get back.

 CapacityScheduler scheduler lock obtained unnecessarily
 ---

 Key: YARN-3487
 URL: https://issues.apache.org/jira/browse/YARN-3487
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
 YARN-3487.003.patch


 Recently saw a significant slowdown of applications on a large cluster, and 
 we noticed there were a large number of blocked threads on the RM.  Most of 
 the blocked threads were waiting for the CapacityScheduler lock while calling 
 getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Inigo Goiri (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500348#comment-14500348
 ] 

Inigo Goiri commented on YARN-3482:
---

To make it match yarn.nodemanager.resource.cpu-vcores and 
yarn.nodemanager.resource.memory-mb, I'm calling it 
yarn.nodemanager.resource.dynamic-availability.

 Report NM available resources in heartbeat
 --

 Key: YARN-3482
 URL: https://issues.apache.org/jira/browse/YARN-3482
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
   Original Estimate: 504h
  Remaining Estimate: 504h

 NMs are usually collocated with other processes like HDFS, Impala or HBase. 
 To manage this scenario correctly, YARN should be aware of the actual 
 available resources. The proposal is to have an interface to dynamically 
 change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Inigo Goiri (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500223#comment-14500223
]

Inigo Goiri commented on YARN-3482:
---

Makes sense. I think we should implement both and give the option to use one or
the other. Proposal for the names of the variables?
yarn.nodemanager.track-utilization.node=true/false
yarn.nodemanager.track-utilization.containers=true/false
yarn.nodemanager.resource=true/false

(The second one would be for YARN-3481.)

For the interface, the simplest thing is to edit
yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb in
yarn-site.xml. However, this implies modifying the XML periodically which is
kind of dirty for this purpose. I guess the cleanest is using the admin
interface, preferences?

Report NM available resources in heartbeat
--

Key: YARN-3482
URL: https://issues.apache.org/jira/browse/YARN-3482
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
Original Estimate: 504h
Remaining Estimate: 504h

NMs are usually collocated with other processes like HDFS, Impala or HBase.
To manage this scenario correctly, YARN should be aware of the actual
available resources. The proposal is to have an interface to dynamically
change the available resources and report this to the RM in every heartbeat.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500298#comment-14500298
 ] 

Rohith commented on YARN-3410:
--

bq. I think RM will check pid while start to avoid this case, correct?
Yes, in the same user two RM can not be started. It check for PID and fail it. 
YARN-2268 disallows the formatting state store while RM is running. The same 
verification can be made for this also in that JIRA

 YARN admin should be able to remove individual application records from 
 RMStateStore
 

 Key: YARN-3410
 URL: https://issues.apache.org/jira/browse/YARN-3410
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, yarn
Reporter: Wangda Tan
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch


 When RM state store entered an unexpected state, one example is YARN-2340, 
 when an attempt is not in final state but app already completed, RM can never 
 get up unless format RMStateStore.
 I think we should support remove individual application records from 
 RMStateStore to unblock RM admin make choice of either waiting for a fix or 
 format state store.
 In addition, RM should be able to report all fatal errors (which will 
 shutdown RM) when doing app recovery, this can save admin some time to remove 
 apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily


 [ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3487:
-
Issue Type: Sub-task  (was: Bug)
Parent: YARN-3091

 CapacityScheduler scheduler lock obtained unnecessarily
 ---

 Key: YARN-3487
 URL: https://issues.apache.org/jira/browse/YARN-3487
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
 YARN-3487.003.patch


 Recently saw a significant slowdown of applications on a large cluster, and 
 we noticed there were a large number of blocked threads on the RM.  Most of 
 the blocked threads were waiting for the CapacityScheduler lock while calling 
 getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Attachment: YARN-3491.001.patch

 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch, YARN-3491.001.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp

2015-04-17 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated YARN-3021:

Release Note: 
ResourceManager renews delegation tokens for applications. This behavior has 
been changed to renew tokens only if the token's renewer is a non-empty string. 
MapReduce jobs can instruct ResourceManager to skip renewal of tokens obtained 
from certain hosts by specifying the hosts with configuration 
mapreduce.job.hdfs-servers.token-renewal.exclude=host1,host2,..,hostN. 


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS


[ 
https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500037#comment-14500037
 ] 

Junping Du commented on YARN-3046:
--

Thanks [~rkanter] and [~zjshen] for review and comments!
bq. One minor thing: Is there a JIRA for this TODO?
Yes. YARN-3367. Will add JIRA number in TODO comment here.

bq. Task entity Id should be be the job Id, but the task Id. PS: there's a typo 
here
Nice catch! This is definitely a bug. Fix it (and typo) in v3 patch. 

 [Event producers] Implement MapReduce AM writing some MR metrics to ATS
 ---

 Key: YARN-3046
 URL: https://issues.apache.org/jira/browse/YARN-3046
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Junping Du
 Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, 
 YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch


 Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes 
 written) and have the MR AM write the framework-specific metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM

2015-04-17 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500062#comment-14500062
 ] 

Sangjin Lee commented on YARN-3390:
---

Thanks Zhijie. I'll move forward with the existing patch for YARN-3437. You can 
still make the change of String = ApplicationId as part of this JIRA (as it 
involves more refactoring). How's that sound?

 Reuse TimelineCollectorManager for RM
 -

 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3390.1.patch


 RMTimelineCollector should have the context info of each app whose entity  
 has been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500268#comment-14500268
 ] 

Wangda Tan commented on YARN-3410:
--

One question:
What will happen if a running app is removed from state store while RM is 
running, will it cause state corrupted? I think RM will check pid while start 
to avoid this case, correct?

And tried to deploy a local cluster to try this, everything works fine, one 
minor comment about usage:
{code}
Usage: java ResourceManager [-format-state-store] |

   [-remove-application-from-state-store ApplicationId]
{code}

Better to format it to?
{code}
Usage: yarn resourcemanager [-format-state-store]
[-remove..] appId
{code}
  

 YARN admin should be able to remove individual application records from 
 RMStateStore
 

 Key: YARN-3410
 URL: https://issues.apache.org/jira/browse/YARN-3410
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, yarn
Reporter: Wangda Tan
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch


 When RM state store entered an unexpected state, one example is YARN-2340, 
 when an attempt is not in final state but app already completed, RM can never 
 get up unless format RMStateStore.
 I think we should support remove individual application records from 
 RMStateStore to unblock RM admin make choice of either waiting for a fix or 
 format state store.
 In addition, RM should be able to report all fatal errors (which will 
 shutdown RM) when doing app recovery, this can save admin some time to remove 
 apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly

2015-04-17 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500307#comment-14500307
 ] 

Xuan Gong commented on YARN-2605:
-

Thanks for the review. [~ste...@apache.org]
bq. Why is a test now tagged as @ignore?
The testcase does not work at all if we made the changes. It gives me too many 
redirect loops exception.

 [RM HA] Rest api endpoints doing redirect incorrectly
 -

 Key: YARN-2605
 URL: https://issues.apache.org/jira/browse/YARN-2605
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: bc Wong
Assignee: Xuan Gong
  Labels: newbie
 Attachments: YARN-2605.1.patch


 The standby RM's webui tries to do a redirect via meta-refresh. That is fine 
 for pages designed to be viewed by web browsers. But the API endpoints 
 shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd 
 suggest HTTP 303, or return a well-defined error message (json or xml) 
 stating that the standby status and a link to the active RM.
 The standby RM is returning this today:
 {noformat}
 $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics
 HTTP/1.1 200 OK
 Cache-Control: no-cache
 Expires: Thu, 25 Sep 2014 18:34:53 GMT
 Date: Thu, 25 Sep 2014 18:34:53 GMT
 Pragma: no-cache
 Expires: Thu, 25 Sep 2014 18:34:53 GMT
 Date: Thu, 25 Sep 2014 18:34:53 GMT
 Pragma: no-cache
 Content-Type: text/plain; charset=UTF-8
 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
 Content-Length: 117
 Server: Jetty(6.1.26)
 This is standby RM. Redirecting to the current active RM: 
 http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily

2015-04-17 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-3487:
-
Attachment: YARN-3487.003.patch

Thanks for the feedback, Wangda and Sunil.  In the interest of keeping this 
JIRA simple to expedite the getQueueInfo and getQueue fix this version of the 
patch restores the lock on checkAccess.  IIRC there's already another JIRA 
proposing to add read/write locks to the CapacityScheduler to handle rare 
events like queue config refresh.

 CapacityScheduler scheduler lock obtained unnecessarily
 ---

 Key: YARN-3487
 URL: https://issues.apache.org/jira/browse/YARN-3487
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
 YARN-3487.003.patch


 Recently saw a significant slowdown of applications on a large cluster, and 
 we noticed there were a large number of blocked threads on the RM.  Most of 
 the blocked threads were waiting for the CapacityScheduler lock while calling 
 getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]


 [ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2003:
--
Attachment: 0007-YARN-2003.patch

Rebased the patch and addressed the comments.
Thank you [~leftnoteasy]

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
 0006-YARN-2003.patch, 0007-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp

2015-04-17 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500368#comment-14500368
 ] 

Yongjun Zhang commented on YARN-3021:
-

Thanks also to [~ka...@cloudera.com] for the earlier discussions, and we worked 
out a release notes which I just updated.


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp

2015-04-17 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500029#comment-14500029
 ] 

Yongjun Zhang commented on YARN-3021:
-

Thanks again [~jianhe] for the reviews/suggestions and committing!

Thanks [~qwertymaniac] for diagnosing and reporting the issue, Harsh, 
[~vinodkv], [~adhoot] for the reviews and discussions!



 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM


[ 
https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500152#comment-14500152
 ] 

Zhijie Shen commented on YARN-3390:
---

bq. How's that sound?

Sure, I'll take care of it.

 Reuse TimelineCollectorManager for RM
 -

 Key: YARN-3390
 URL: https://issues.apache.org/jira/browse/YARN-3390
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3390.1.patch


 RMTimelineCollector should have the context info of each app whose entity  
 has been put



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500184#comment-14500184
 ] 

Wangda Tan commented on YARN-3463:
--

[~cwelch],
Thanks for update, patch LGTM, +1. 

 Integrate OrderingPolicy Framework with CapacityScheduler
 -

 Key: YARN-3463
 URL: https://issues.apache.org/jira/browse/YARN-3463
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3463.50.patch, YARN-3463.61.patch, 
 YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, 
 YARN-3463.67.patch, YARN-3463.68.patch


 Integrate the OrderingPolicy Framework with the CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1402) Related Web UI, CLI changes on exposing client API to check log aggregation status


[ 
https://issues.apache.org/jira/browse/YARN-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500193#comment-14500193
 ] 

Junping Du commented on YARN-1402:
--

bq. This is a good point. The reports also need to be used in generating the 
RMAppLogAggregationWebUI. So, we can not simply delete them. 
Agree that we cannot simply delete them. Will file one to start discussion on 
solutions.

+1. Will commit the latest patch shortly.

 Related Web UI, CLI changes on exposing client API to check log aggregation 
 status
 --

 Key: YARN-1402
 URL: https://issues.apache.org/jira/browse/YARN-1402
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-1402.1.patch, YARN-1402.2.patch, 
 YARN-1402.3.1.patch, YARN-1402.3.2.patch, YARN-1402.3.patch, YARN-1402.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS


 [ 
https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3046:
-
Attachment: YARN-3046-v3.patch

Incorporate [~zjshen] and [~rkanter]'s comments in v3 patch! Also, identify a 
NPE issue in previous patch for MiniMRYarnCluster (if not setting auxiliary 
service explicitly). Verify related tests can pass.

 [Event producers] Implement MapReduce AM writing some MR metrics to ATS
 ---

 Key: YARN-3046
 URL: https://issues.apache.org/jira/browse/YARN-3046
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Junping Du
 Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, 
 YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch, 
 YARN-3046-v3.patch


 Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes 
 written) and have the MR AM write the framework-specific metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration


 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3136:
--
Attachment: (was: 00011-YARN-3136.patch)

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
 00011-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 
 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 
 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500097#comment-14500097
 ] 

Karthik Kambatla commented on YARN-3482:


bq. With this and the containers utilization, we can estimate the utilization 
of external processes.

True, but I fear that will be too conservative. If we go that route, HBase 
RegionServers could grow aggressively and adversely affect resources under 
Yarn. By having an interface for available resources, we ensure Yarn 
aggressively schedules work to claim all available resources. Changing these 
available resources could be through a secure interface admins or a white-list 
of processes can access. 

 Report NM available resources in heartbeat
 --

 Key: YARN-3482
 URL: https://issues.apache.org/jira/browse/YARN-3482
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
   Original Estimate: 504h
  Remaining Estimate: 504h

 NMs are usually collocated with other processes like HDFS, Impala or HBase. 
 To manage this scenario correctly, YARN should be aware of the actual 
 available resources. The proposal is to have an interface to dynamically 
 change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration


 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3136:
--
Attachment: 00011-YARN-3136.patch

Checking jenkins again

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
 00011-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 
 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 
 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2

2015-04-17 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500187#comment-14500187
 ] 

Jonathan Eagles commented on YARN-3437:
---

Now that I have dug into timeline server performance (YARN-3448). I have a 
better understanding of what type of writes are costly. For example, a single 
entity will generate dozens or writes to the database. The number of primary 
keys, the number of related entities, and the write batch size (entities per 
put) greatly affect the time an entity put takes. While this is a good start, I 
think there should at least be a follow up that addresses these issues to 
better measure the write performance.

 convert load test driver to timeline service v.2
 

 Key: YARN-3437
 URL: https://issues.apache.org/jira/browse/YARN-3437
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3437.001.patch, YARN-3437.002.patch


 This subtask covers the work for converting the proposed patch for the load 
 test driver (YARN-2556) to work with the timeline service v.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2


[ 
https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500112#comment-14500112
 ] 

Zhijie Shen commented on YARN-3437:
---

Will take a look today.

 convert load test driver to timeline service v.2
 

 Key: YARN-3437
 URL: https://issues.apache.org/jira/browse/YARN-3437
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3437.001.patch, YARN-3437.002.patch


 This subtask covers the work for converting the proposed patch for the load 
 test driver (YARN-2556) to work with the timeline service v.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp


[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500044#comment-14500044
 ] 

Hudson commented on YARN-3021:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2116 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2116/])
YARN-3021. YARN's delegation-token handling disallows certain trust setups to 
operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev 
bb6dde68f19be1885a9e7f7949316a03825b6f3e)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java


 YARN's delegation-token handling disallows certain trust setups to operate 
 properly over DistCp
 ---

 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J
Assignee: Yongjun Zhang
 Fix For: 2.8.0

 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
 YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
 YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
 YARN-3021.007.patch, YARN-3021.patch


 Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
 and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
 clusters.
 Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
 needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
 as it attempts a renewDelegationToken(…) synchronously during application 
 submission (to validate the managed token before it adds it to a scheduler 
 for automatic renewal). The call obviously fails cause B realm will not trust 
 A's credentials (here, the RM's principal is the renewer).
 In the 1.x JobTracker the same call is present, but it is done asynchronously 
 and once the renewal attempt failed we simply ceased to schedule any further 
 attempts of renewals, rather than fail the job immediately.
 We should change the logic such that we attempt the renewal but go easy on 
 the failure and skip the scheduling alone, rather than bubble back an error 
 to the client, failing the app submission. This way the old behaviour is 
 retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.


 [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3491:

Description: 
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which takes about 10+ ms.
The total delay will be approximately number of local dirs * 10+ ms.
This delay will be added for each public resource localization.
Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
utilized. Instead of doing public resource localization in 
parallel(multithreading), public resource localization is serialized most of 
the time.

And also PublicLocalizer#addResource is running in Dispatcher thread, 
So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
long time.

  was:
Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
checkLocalDir is very slow which is about 10 ms.
The total delay will be approximately number of local dirs * 10 ms.
This delay will be added for each public resource localization.
Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
utilized. Instead of doing public resource localization in 
parallel(multithreading), public resource localization is serialized most of 
the time.

And also PublicLocalizer#addResource is running in Dispatcher thread, 
So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
long time.


 PublicLocalizer#addResource is too slow.
 

 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3491.000.patch


 Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
 getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
 checkLocalDir is very slow which takes about 10+ ms.
 The total delay will be approximately number of local dirs * 10+ ms.
 This delay will be added for each public resource localization.
 Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
 utilized. Instead of doing public resource localization in 
 parallel(multithreading), public resource localization is serialized most of 
 the time.
 And also PublicLocalizer#addResource is running in Dispatcher thread, 
 So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
 long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label


[ 
https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500200#comment-14500200
 ] 

Wangda Tan commented on YARN-2696:
--

Failed test is tracked by YARN-2483

 Queue sorting in CapacityScheduler should consider node label
 -

 Key: YARN-2696
 URL: https://issues.apache.org/jira/browse/YARN-2696
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2696.1.patch, YARN-2696.2.patch, YARN-2696.3.patch, 
 YARN-2696.4.patch


 In the past, when trying to allocate containers under a parent queue in 
 CapacityScheduler. The parent queue will choose child queues by the used 
 resource from smallest to largest. 
 Now we support node label in CapacityScheduler, we should also consider used 
 resource in child queues by node labels when allocating resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore


 [ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3410:
-
Attachment: 0003-YARN-3410.patch

 YARN admin should be able to remove individual application records from 
 RMStateStore
 

 Key: YARN-3410
 URL: https://issues.apache.org/jira/browse/YARN-3410
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, yarn
Reporter: Wangda Tan
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch


 When RM state store entered an unexpected state, one example is YARN-2340, 
 when an attempt is not in final state but app already completed, RM can never 
 get up unless format RMStateStore.
 I think we should support remove individual application records from 
 RMStateStore to unblock RM admin make choice of either waiting for a fix or 
 format state store.
 In addition, RM should be able to report all fatal errors (which will 
 shutdown RM) when doing app recovery, this can save admin some time to remove 
 apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3500) Optimize ResourceManager Web loading speed

2015-04-17 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3500:
---
Priority: Major  (was: Minor)

 Optimize ResourceManager Web loading speed
 --

 Key: YARN-3500
 URL: https://issues.apache.org/jira/browse/YARN-3500
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Peter Shi

 after running 10k jobs, resoucemanager webui load speed become slow. As 
 server side send 10k jobs information in one response, parsing and rendering 
 page will cost a long time. Current paging logic is done in browser side. 
 This issue makes server side to do the paging logic, so that the loading will 
 be fast.
 Loading 10k jobs costs 55 sec. loading 2k costs 7 sec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration