date:20150702

[jira] [Commented] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.

2015-07-02 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612909#comment-14612909
 ] 

Xuan Gong commented on YARN-3882:
-

+1 LGTM. Pending Jenkins

> AggregatedLogFormat should close aclScanner and ownerScanner after create 
> them.
> ---
>
> Key: YARN-3882
> URL: https://issues.apache.org/jira/browse/YARN-3882
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-3882.000.patch
>
>
> AggregatedLogFormat should close aclScanner and ownerScanner after create 
> them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in 
> {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. 
> {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them 
> after use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612872#comment-14612872
 ] 

Devaraj K commented on YARN-3883:
-

Please go ahead [~brahmareddy].

> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> --
>
> Key: YARN-3883
> URL: https://issues.apache.org/jira/browse/YARN-3883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Devaraj K
>Assignee: Brahma Reddy Battula
>
> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> Below one is the report from the YarnClient.getApplicationReport(), It 
> doesn't show the diagnostics for the application which has FinalStatus as 
> FAILED and YarnApplicationState as FINISHED.
> {code:xml}
> 15/07/03 15:53:27 INFO yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: XX.XXX.XX.XX
>  ApplicationMaster RPC port: 0
>  queue: default
>  start time: 1435918986890
>  final status: FAILED
>  tracking URL: 
> http://stobdtserver2:8088/proxy/application_1435848120635_0015/
>  user: root
> {code}
> But we can see the Diagnostics information in the RM Web UI for the same 
> application.
> {code:xml}
> YarnApplicationState: FINISHED
> Queue:default
> FinalStatus Reported by AM:   FAILED
> Started:  Fri Jul 03 15:53:06 +0530 2015
> Elapsed:  20sec
> Tracking URL: History
> Log Aggregation StatusDISABLED
> Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
> For input string: "xx"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612865#comment-14612865
 ] 

Brahma Reddy Battula commented on YARN-3883:


[~devaraj.k] I would like to work on this.. If you already started working on 
this, you can reassign.. thanks

> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> --
>
> Key: YARN-3883
> URL: https://issues.apache.org/jira/browse/YARN-3883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Devaraj K
>Assignee: Brahma Reddy Battula
>
> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> Below one is the report from the YarnClient.getApplicationReport(), It 
> doesn't show the diagnostics for the application which has FinalStatus as 
> FAILED and YarnApplicationState as FINISHED.
> {code:xml}
> 15/07/03 15:53:27 INFO yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: XX.XXX.XX.XX
>  ApplicationMaster RPC port: 0
>  queue: default
>  start time: 1435918986890
>  final status: FAILED
>  tracking URL: 
> http://stobdtserver2:8088/proxy/application_1435848120635_0015/
>  user: root
> {code}
> But we can see the Diagnostics information in the RM Web UI for the same 
> application.
> {code:xml}
> YarnApplicationState: FINISHED
> Queue:default
> FinalStatus Reported by AM:   FAILED
> Started:  Fri Jul 03 15:53:06 +0530 2015
> Elapsed:  20sec
> Tracking URL: History
> Log Aggregation StatusDISABLED
> Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
> For input string: "xx"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula reassigned YARN-3883:
--

Assignee: Brahma Reddy Battula

> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> --
>
> Key: YARN-3883
> URL: https://issues.apache.org/jira/browse/YARN-3883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Devaraj K
>Assignee: Brahma Reddy Battula
>
> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> Below one is the report from the YarnClient.getApplicationReport(), It 
> doesn't show the diagnostics for the application which has FinalStatus as 
> FAILED and YarnApplicationState as FINISHED.
> {code:xml}
> 15/07/03 15:53:27 INFO yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: XX.XXX.XX.XX
>  ApplicationMaster RPC port: 0
>  queue: default
>  start time: 1435918986890
>  final status: FAILED
>  tracking URL: 
> http://stobdtserver2:8088/proxy/application_1435848120635_0015/
>  user: root
> {code}
> But we can see the Diagnostics information in the RM Web UI for the same 
> application.
> {code:xml}
> YarnApplicationState: FINISHED
> Queue:default
> FinalStatus Reported by AM:   FAILED
> Started:  Fri Jul 03 15:53:06 +0530 2015
> Elapsed:  20sec
> Tracking URL: History
> Log Aggregation StatusDISABLED
> Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
> For input string: "xx"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612857#comment-14612857
 ] 

Devaraj K commented on YARN-3883:
-

It is occurring due to this reason,

While creating ApplicationReport as part of 
ClientRMService.getApplicationReport(), It is setting YarnApplicationState as 
FINISHED even when the RMAppState is in FINISHING to the application report. 
Diagnostics is not available when the RMAppState is FINISHING and it is setting 
to RMAppImpl during the AppFinishedTransition when it is moving to FINISHED 
state.

> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> --
>
> Key: YARN-3883
> URL: https://issues.apache.org/jira/browse/YARN-3883
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Devaraj K
>
> YarnClient.getApplicationReport() doesn't not give diagnostics for the 
> FINISHED state applications some times 
> Below one is the report from the YarnClient.getApplicationReport(), It 
> doesn't show the diagnostics for the application which has FinalStatus as 
> FAILED and YarnApplicationState as FINISHED.
> {code:xml}
> 15/07/03 15:53:27 INFO yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: XX.XXX.XX.XX
>  ApplicationMaster RPC port: 0
>  queue: default
>  start time: 1435918986890
>  final status: FAILED
>  tracking URL: 
> http://stobdtserver2:8088/proxy/application_1435848120635_0015/
>  user: root
> {code}
> But we can see the Diagnostics information in the RM Web UI for the same 
> application.
> {code:xml}
> YarnApplicationState: FINISHED
> Queue:default
> FinalStatus Reported by AM:   FAILED
> Started:  Fri Jul 03 15:53:06 +0530 2015
> Elapsed:  20sec
> Tracking URL: History
> Log Aggregation StatusDISABLED
> Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
> For input string: "xx"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Devaraj K (JIRA)

Devaraj K created YARN-3883:
---

 Summary: YarnClient.getApplicationReport() doesn't not give 
diagnostics for the FINISHED state applications some times 
 Key: YARN-3883
 URL: https://issues.apache.org/jira/browse/YARN-3883
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Devaraj K


YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED 
state applications some times 

Below one is the report from the YarnClient.getApplicationReport(), It doesn't 
show the diagnostics for the application which has FinalStatus as FAILED and 
YarnApplicationState as FINISHED.
{code:xml}
15/07/03 15:53:27 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: XX.XXX.XX.XX
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1435918986890
 final status: FAILED
 tracking URL: 
http://stobdtserver2:8088/proxy/application_1435848120635_0015/
 user: root
{code}


But we can see the Diagnostics information in the RM Web UI for the same 
application.
{code:xml}
YarnApplicationState:   FINISHED
Queue:  default
FinalStatus Reported by AM: FAILED
Started:Fri Jul 03 15:53:06 +0530 2015
Elapsed:20sec
Tracking URL:   History
Log Aggregation Status  DISABLED
Diagnostics:User class threw exception: java.lang.NumberFormatException: 
For input string: "xx"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup

2015-07-02 Thread Dian Fu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612805#comment-14612805
 ] 

Dian Fu commented on YARN-2923:
---

{quote}But would also would like to get inputs from other folks in the 
Opensource for exposing this interface in RM side... may be based on this i 
would like to move into hadoop-yarn-server-common.{quote}
Yes, of course.

> Support configuration based NodeLabelsProvider Service in Distributed Node 
> Label Configuration Setup 
> -
>
> Key: YARN-2923
> URL: https://issues.apache.org/jira/browse/YARN-2923
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, 
> YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, 
> YARN-2923.20150517-1.patch
>
>
> As part of Distributed Node Labels configuration we need to support Node 
> labels to be configured in Yarn-site.xml. And on modification of Node Labels 
> configuration in yarn-site.xml, NM should be able to get modified Node labels 
> from this NodeLabelsprovider service without NM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612798#comment-14612798
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
We don't have to make it at container level I think but also not necessary for 
AM to retain and aggregate these values. AM could help to forward the values to 
per app timeline collector but don't have to aggregate them. Vinod got more 
ideas on this in offline discussion. [~vinodkv], can you comment on this?
{quote}

Interesting. Could you or [~vinodkv] shed light on the idea? It would still 
need to be captured in an entity or entities, right? I would think sending it 
as part of the container entities would be simpler and more consistent (in that 
the per-app collector can simply look at all container metrics as subject to 
aggregation). I'd love to hear more about this.

{quote}
I think "per-container averages" is not equal to per-container resource usage. 
Understanding application's real resource consumption/usage is one of the core 
use cases for new timeline service at the beginning so I don't think we should 
rule out anything important here.
{quote}

How is the per-container resource usage different than the per-container 
average described in the summary? Could you kindly provide its definition?

No doubt understanding applications' real resource consumption/usage is 
critical. Between the individual container resource usage (which are all 
captured), the aggregated resource usage at the app/flow level (which the basic 
real time aggregation addresses), and the running averages/max of the 
aggregated resource usage at the app/flow level, I think it definitely covers 
that need. What would be the gap that's not addressed by the above data?

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-07-02 Thread Inigo Goiri (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-313:
-
Attachment: YARN-313-v5.patch

Fixed checkstyle (the ones that I could and made sense)
Fixed one unit test (the other one, I have no idea why it breaks, it's in the 
refreshNodes)

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612729#comment-14612729
 ] 

Devaraj K commented on YARN-3878:
-

Thanks [~varun_saxena] for the patch and [~jianhe] for the review. There are 
few minor comments on the patch, you can address these before getting this 
patch in.

* I think AsyncDispatcher.isDrained() method can be removed now from 
AsyncDispatcher and eventQueue.isEmpty() can verified directly in the tests.
* In TestAsyncDispatcher, can you remove this jira number comment and add the 
comment about what test does?
{code:xml}
/* Test to verify fix for YARN-3878 */
{code}
* In TestAsyncDispatcher, Please use *disp.close()* instead of disp.stop().

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.jav

[jira] [Updated] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.

2015-07-02 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3882:

Attachment: YARN-3882.000.patch

> AggregatedLogFormat should close aclScanner and ownerScanner after create 
> them.
> ---
>
> Key: YARN-3882
> URL: https://issues.apache.org/jira/browse/YARN-3882
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-3882.000.patch
>
>
> AggregatedLogFormat should close aclScanner and ownerScanner after create 
> them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in 
> {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. 
> {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them 
> after use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.

2015-07-02 Thread zhihai xu (JIRA)

zhihai xu created YARN-3882:
---

 Summary: AggregatedLogFormat should close aclScanner and 
ownerScanner after create them.
 Key: YARN-3882
 URL: https://issues.apache.org/jira/browse/YARN-3882
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


AggregatedLogFormat should close aclScanner and ownerScanner after create them. 
{{aclScanner}} and {{ownerScanner}} are created by createScanner in 
{{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. 
{{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them 
after use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup

2015-07-02 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612706#comment-14612706
 ] 

Naganarasimha G R commented on YARN-2923:
-

Thanks [~dian.fu], for the review. But would also would like to get inputs from 
other folks in the Opensource for exposing this interface in RM side... may be 
based on this i would like to move into {{hadoop-yarn-server-common}}.
[~leftnoteasy], Its been a long time we revisited the distributed node labeling 
jira's can you please check and review once ...


> Support configuration based NodeLabelsProvider Service in Distributed Node 
> Label Configuration Setup 
> -
>
> Key: YARN-2923
> URL: https://issues.apache.org/jira/browse/YARN-2923
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, 
> YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, 
> YARN-2923.20150517-1.patch
>
>
> As part of Distributed Node Labels configuration we need to support Node 
> labels to be configured in Yarn-site.xml. And on modification of Node Labels 
> configuration in yarn-site.xml, NM should be able to get modified Node labels 
> from this NodeLabelsprovider service without NM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-07-02 Thread Inigo Goiri (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612702#comment-14612702
 ] 

Inigo Goiri commented on YARN-313:
--

Up to you [~djp], you did the work.
I'm just trying to keep it up to date with trunk.

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612687#comment-14612687
 ] 

Junping Du commented on YARN-3815:
--

Thanks [~sjlee0] for comments!
bq. I think it is pretty natural and straightforward for AMs to aggregate and 
retain values at the app level, but even if they set it at the container level, 
it could work.
I would rather say it is "natural" before timeline service v2 comes out. :) We 
don't have to make it at container level I think but also not necessary for AM 
to retain and aggregate these values. AM could help to forward the values to 
per app timeline collector but don't have to aggregate them. Vinod got more 
ideas on this in offline discussion. [~vinodkv], can you comment on this?

bq. Note that we're not proposing to keep the average as a time series. So I'm 
not sure if that is feasible.
If not, we may consider to change the proposal to support time series given the 
data is not too much here.

bq. We also ruled out per-container averages (explained in the summary), so 
per-task resource usage is not an example we're looking for.
I think "per-container averages" is not equal to per-container resource usage. 
Understanding application's real resource consumption/usage is one of the core 
use cases for new timeline service at the beginning so I don't think we should 
rule out anything important here.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612684#comment-14612684
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
The use case here should be obviously. A quick real life example here is Google 
Borg - cluster management tools 
(http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf)
 which aggregate per-task resource usage information for usage-based charging, 
debugging job and long-term capacity planning.
{quote}

Thanks [~djp]. What I'm looking for is a little more specific examples. That's 
why we spent some time during the discussion to define precisely what we mean 
by "averages". We discovered that there were already two different definitions 
of the average for gauges. We also ruled out per-container averages (explained 
in the summary), so per-task resource usage is not an example we're looking for.

So as for the moving (but aggregate) average, are there other examples? What we 
discussed during the meeting (also in the summary) was the total CPU 
utilization of an app/flow. Other examples, and how they might be useful, or is 
that pretty much the best example?

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612680#comment-14612680
 ] 

Sangjin Lee commented on YARN-3815:
---

bq. This way sounds very clever. In addition, if we need resource consumption 
at any standpoint or time window (t1 - t2), we can simply do Avg(t2) * t2 - 
Avg(t1) * t1. This is much better than aggregating value on each stand point 
when query.

Note that we're not proposing to keep the average as a *time series*. So I'm 
not sure if that is feasible.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612678#comment-14612678
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
We may consider to provide two ways here:
- For legacy applications - like MR, AM already have done aggregation on these 
counters themselves.
- For new application to build against YARN after timeline service v2, AM can 
delegate YARN timeline service to do aggregation instead of do it themselves. 
Our data model and aggregation mechanism should assure YARN timeline service 
can aggregate these framework-specif metrics without get predefined.
{quote}

I think it's a little more complicated than that. If a new YARN application 
wants to delegate aggregation to the YARN timeline service, it still needs to 
do at least the following:
- add the framework-specific metrics to the YARN container
- do *not* add any of those metrics to the YARN application

The framework-specific metrics set on the containers would still be transmitted 
by the AM (not by the node managers). Then, the YARN timeline service could 
look at *any* container metrics and apply the uniform aggregation rules.

Hopefully YARN apps can add metric values to container entities (there should 
be a natural mapping from unit of work to containers), otherwise it won't work 
for them...

I think it is pretty natural and straightforward for AMs to aggregate and 
retain values at the app level, but even if they set it at the container level, 
it could work.

On the other hand, if your app wants to own aggregation, then it should not set 
the metrics on the containers, or it would be done twice.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-02 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612651#comment-14612651
 ] 

Anubhav Dhoot commented on YARN-433:


LGTM

> When RM is catching up with node updates then it should not expire acquired 
> containers
> --
>
> Key: YARN-433
> URL: https://issues.apache.org/jira/browse/YARN-433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-433.1.patch, YARN-433.2.patch
>
>
> RM expires containers that are not launched within some time of being 
> allocated. The default is 10mins. When an RM is not keeping up with node 
> updates then it may not be aware of new launched containers. If the expire 
> thread fires for such containers then the RM can expire them even though they 
> may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612648#comment-14612648
 ] 

Junping Du commented on YARN-3815:
--

bq. Also, it would be GREAT if you could give a clear and compelling use case 
(a real life example) on why such support would be crucial. Thanks!
The use case here should be obviously. A quick real life example here is Google 
Borg - cluster management tools 
(http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf)
 which aggregate per-task resource usage information for usage-based charging, 
debugging job and long-term capacity planning.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3866) AM-RM protocol changes to support container resizing

2015-07-02 Thread MENG DING (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-3866:

Attachment: YARN-3866.2.patch

Attached new patch based on the review comments.

1) Moved most of the PB records test cases to {{TestPBImplRecords}}, and 
deleted unnecessary test files.
2) Fixed JavaDoc annotation, and improved/added cooments to all public methods. 
Checked generated Java docs to make sure they look OK
3) Fixed indentation problem.
4) Added {{ContainerStatus}}, {{ContainerStatusPBImpl}}

I will fix JavaDoc and indentation issues in other tickets as well.

Thanks [~leftnoteasy] for the review.


> AM-RM protocol changes to support container resizing
> 
>
> Key: YARN-3866
> URL: https://issues.apache.org/jira/browse/YARN-3866
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-3866.1.patch, YARN-3866.2.patch
>
>
> YARN-1447 and YARN-1448 are outdated. 
> This ticket deals with AM-RM Protocol changes to support container resize 
> according to the latest design in YARN-1197.
> 1) Add increase/decrease requests in AllocateRequest
> 2) Get approved increase/decrease requests from RM in AllocateResponse
> 3) Add relevant test cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics

2015-07-02 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612638#comment-14612638
 ] 

Zhijie Shen commented on YARN-3881:
---

Once the metrics are ready, we can build YARN/timeline service builtin webUI to 
show this information, as well as expose it via API, such that third party 
monitoring like ambari can integrate with it. I think it should be quite 
flexible.

> Writing RM cluster-level metrics
> 
>
> Key: YARN-3881
> URL: https://issues.apache.org/jira/browse/YARN-3881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: metrics.json
>
>
> RM has a bunch of metrics that we may want to write into the timeline backend 
> to. I attached the metrics.json that I've crawled via 
> {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
> three groups of metrics:
> 1. QueueMetrics
> 2. JvmMetrics
> 3. ClusterMetrics
> The problem is that unlike other metrics belongs to a single application, 
> these ones belongs to RM or cluster-wide. Therefore, current write path is 
> not going to work for these metrics because they don't have the associated 
> user/flow/app context info. We need to rethink of modeling cross-app metrics 
> and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612629#comment-14612629
 ] 

Sangjin Lee commented on YARN-3815:
---

For gauges and their averages and max in particular, [~vinodkv], [~gtCarrera9], 
[~djp], could you please confirm what I captured in that document is exactly 
what we want to support? Could you please comment on that?

Also, it would be *GREAT* if you could give a clear and compelling use case (a 
real life example) on why such support would be crucial. Thanks!

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612620#comment-14612620
 ] 

Junping Du commented on YARN-3815:
--

bq.  app-level aggregation for framework-specific metrics will be done by the 
AM.
I think there is a little misunderstanding on this - just like I mentioned 
above, AM should/could get relieved from aggregating counters themselves after 
timeline service v2. Legacy AMs could still push aggregated counters to backend 
storage though. Others who also sit in the room, any comments here? 

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612592#comment-14612592
 ] 

Sangjin Lee commented on YARN-3815:
---

Here is my take on what's consensus, what's not, and what's currently out of 
scope. I may have misread the discussion and your impression/understanding may 
be different, so please feel free to chime in and comment on this!

(consensus or not controversial)
- applications table will be split from the main entities table
- app-level aggregation for framework-specific metrics will be done by the AM
- app-level aggregation for YARN-system container metrics will be done by the 
per-app timeline collector
- real-time aggregation does simple sum for all types of metrics
- metrics API will be updated to differentiate gauges and counters (the type 
information will need to be persisted in the storage)
- for gauges, in addition to the simple sum-based aggregation, support average 
and max
- the flow-run table will be created to handle app-to-flow-run ("real-time") 
aggregation as proposed in the native HBase schema design
- auxiliary tables will be implemented as proposed in the native HBase schema 
design
- time-based aggregation (daily, weekly, monthly, etc.) will be done via 
phoenix tables to enable ad-hoc queries

(questions remaining or undecided)
- for the average/max support for gauges (see above), confirm that's exactly 
what we want to support
- how to implement app-to-flow-run aggregation for gauges
- how to perform the time-based aggregation (mapreduce, using co-processor 
endpoints, etc.)
- how to handle long-running apps for time-based aggregation
- considering adopting "null delimiters" (or other phoenix-friendly tools) to 
support phoenix reading data from the native HBase tables
- using flow collectors, user collectors, and queue collectors as means of 
performing (higher-level) aggregation

(out of scope)
- support per-container averages for gauges
- any aggregation other than time-based aggregation for flows, users, and queues
- creating a dependency on the explicit YARN flow API

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612591#comment-14612591
 ] 

Junping Du commented on YARN-3815:
--

Thanks [~sjlee0] for nice writeup on the discussions.
Looks good for most parts to me. Some comments on app level aggregations:

bq. Framework‐specific metrics will be sent to the per‐app collector aggregated 
by the AM itself.
We may consider to provide two ways here:
- For legacy applications - like MR, AM already have done aggregation on these 
counters themselves.
- For new application to build against YARN after timeline service v2, AM can 
delegate YARN timeline service to do aggregation instead of do it themselves. 
Our data model and aggregation mechanism should assure YARN timeline service 
can aggregate these framework-specif metrics without get predefined.

bq. time average & max: the average multiplied by the elapsed time of the 
application represents the total resource usage over time.
This way sounds very clever. In addition, if we need resource consumption at 
any standpoint or time window (t1 - t2), we can simply do Avg(t2) * t2 - 
Avg(t1) * t1. This is much better than aggregating value on each stand point 
when query.


> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-07-02 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612553#comment-14612553
 ] 

Junping Du commented on YARN-313:
-

Sorry for coming on this. [~elgoiri], are you interested in taking on this JIRA 
and move it forward? If so, I can assign it to you.

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-07-02 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-313:

Labels:   (was: BB2015-05-TBR)

> Add Admin API for supporting node resource configuration in command line
> 
>
> Key: YARN-313
> URL: https://issues.apache.org/jira/browse/YARN-313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
> YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch
>
>
> We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" 
> to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3445) Cache runningApps in RMNode for getting running apps on given NodeId

2015-07-02 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612543#comment-14612543
 ] 

Junping Du commented on YARN-3445:
--

Thanks for review and comments, [~mingma]!
bq. That is around 10M entries. So it should be ok for RM.
ApplicationId only contains int (4 bytes) and long (8 bytes) field. Even 
consider java object header, padding and PB object overhead, should be far less 
than 100 bytes. Agree that it should be fine even in large scale as mentioned 
scenario.

bq. Do you need synchronizedList in the following list? It looks like the 
access of runningApplications are protected by RMNodeImpl's readLock and 
writeLock.
Nice catch! Will replace synchronizedList will ArrayList and add some 
writeLocks (missing in previous patch).

> Cache runningApps in RMNode for getting running apps on given NodeId
> 
>
> Key: YARN-3445
> URL: https://issues.apache.org/jira/browse/YARN-3445
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-3445-v2.patch, YARN-3445-v3.1.patch, 
> YARN-3445-v3.patch, YARN-3445.patch
>
>
> Per discussion in YARN-3334, we need filter out unnecessary collectors info 
> from RM in heartbeat response. Our propose is to add cache for runningApps in 
> RMNode, so RM only send collectors for local running apps back. This is also 
> needed in YARN-914 (graceful decommission) that if no running apps in NM 
> which is in decommissioning stage, it will get decommissioned immediately. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3815:
--
Attachment: hbase-schema-proposal-for-aggregation.pdf
aggregation-design-discussion.pdf

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
> hbase-schema-proposal-for-aggregation.pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-07-02 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612529#comment-14612529
 ] 

Sangjin Lee commented on YARN-3815:
---

Some of us ([~gtCarrera9], [~vinodkv], [~djp], [~zjshen], [~vrushalic], and 
[~sjlee0]) had a face-to-face design discussion on the aggregation. I am going 
to post the summary of that discussion along with a proposal for an expanded 
native HBase schema to support aggregation.

I believe we are much closer to a consensus on the aggregation design, but some 
important questions still remain. For the sake of public discussion and 
inviting more participants and comments, we should follow up here on this JIRA.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> 
>
> Key: YARN-3815
> URL: https://issues.apache.org/jira/browse/YARN-3815
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: Timeline Service Nextgen Flow, User, Queue Level 
> Aggregations (v1).pdf
>
>
> Per previous discussions in some design documents for YARN-2928, the basic 
> scenario is the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version 
> and flow 
> - User level, expect return: aggregated stats for applications submitted by 
> user
> - Queue level, expect return: aggregated stats for applications within the 
> Queue
> Application states is the basic building block for all other level 
> aggregations. We can provide Flow/User/Queue level aggregated statistics info 
> based on application states (a dedicated table for application states is 
> needed which is missing from previous design documents like HBase/Phoenix 
> schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612511#comment-14612511
 ] 

Jian He commented on YARN-3878:
---

ah, sorry, I overlooked. 
lgtm,  thanks !

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
> [0x7fb989851000]
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x000700b79430> (a java.lang.Object)
>

[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics

2015-07-02 Thread Lei Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612432#comment-14612432
 ] 

Lei Guo commented on YARN-3881:
---

This is an interesting topic, assuming the timeline server provides this 
support, should Ambari or other monitoring tool to use this for monitoring 
purpose? If not, what's the scenario to write RM related metrics?

> Writing RM cluster-level metrics
> 
>
> Key: YARN-3881
> URL: https://issues.apache.org/jira/browse/YARN-3881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: metrics.json
>
>
> RM has a bunch of metrics that we may want to write into the timeline backend 
> to. I attached the metrics.json that I've crawled via 
> {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
> three groups of metrics:
> 1. QueueMetrics
> 2. JvmMetrics
> 3. ClusterMetrics
> The problem is that unlike other metrics belongs to a single application, 
> these ones belongs to RM or cluster-wide. Therefore, current write path is 
> not going to work for these metrics because they don't have the associated 
> user/flow/app context info. We need to rethink of modeling cross-app metrics 
> and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics

2015-07-02 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612429#comment-14612429
 ] 

Zhijie Shen commented on YARN-3881:
---

IMHO, we need to add an addition API to direct write the cross app metrics (or 
already aggregated metrics, if you think of these ones are actually the 
aggregated data of each individual app, such as the counters of 
submitted/pending/running apps) to the backend, in the separate tables, such as 
cluster/queue/user tables, and these data don't need to be aggregated any more.

> Writing RM cluster-level metrics
> 
>
> Key: YARN-3881
> URL: https://issues.apache.org/jira/browse/YARN-3881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: metrics.json
>
>
> RM has a bunch of metrics that we may want to write into the timeline backend 
> to. I attached the metrics.json that I've crawled via 
> {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
> three groups of metrics:
> 1. QueueMetrics
> 2. JvmMetrics
> 3. ClusterMetrics
> The problem is that unlike other metrics belongs to a single application, 
> these ones belongs to RM or cluster-wide. Therefore, current write path is 
> not going to work for these metrics because they don't have the associated 
> user/flow/app context info. We need to rethink of modeling cross-app metrics 
> and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3881) Writing RM cluster-level metrics

2015-07-02 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3881:
--
Attachment: metrics.json

> Writing RM cluster-level metrics
> 
>
> Key: YARN-3881
> URL: https://issues.apache.org/jira/browse/YARN-3881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: metrics.json
>
>
> RM has a bunch of metrics that we may want to write into the timeline backend 
> to. I attached the metrics.json that I've crawled via 
> {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
> three groups of metrics:
> 1. QueueMetrics
> 2. JvmMetrics
> 3. ClusterMetrics
> The problem is that unlike other metrics belongs to a single application, 
> these ones belongs to RM or cluster-wide. Therefore, current write path is 
> not going to work for these metrics because they don't have the associated 
> user/flow/app context info. We need to rethink of modeling cross-app metrics 
> and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3881) Writing RM cluster-level metrics

2015-07-02 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-3881:
-

 Summary: Writing RM cluster-level metrics
 Key: YARN-3881
 URL: https://issues.apache.org/jira/browse/YARN-3881
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen


RM has a bunch of metrics that we may want to write into the timeline backend 
to. I attached the metrics.json that I've crawled via 
{{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
three groups of metrics:

1. QueueMetrics
2. JvmMetrics
3. ClusterMetrics

The problem is that unlike other metrics belongs to a single application, these 
ones belongs to RM or cluster-wide. Therefore, current write path is not going 
to work for these metrics because they don't have the associated user/flow/app 
context info. We need to rethink of modeling cross-app metrics and the api to 
handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612342#comment-14612342
 ] 

Varun Saxena commented on YARN-3878:


[~jianhe], the test case as such is adequate.
I have basically added two assert statements. If first statement is true, and 
second is false, hang will occur.
But as assert statements are there, test would fail before hang occurs.
Without core changes of patch, test will fail at second assertion point. 
But if you remove this second assertion point, hang will occur and test case 
time out.
{code}
Assert.assertTrue("Event Queue should have been empty",
eventQueue.isEmpty());
Assert.assertTrue("Async Dispatcher should have been drained as event " +
"queue is empty", disp.isDrained());

{code}

So do you want me to remove this second assertion statement so that test case 
doesnt fail before hang ? (without core changes).
Let me know.

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$C

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612320#comment-14612320
 ] 

Jian He commented on YARN-3878:
---

Hi [~varun_saxena], the test seems not adequate. It doesn't prove the 
AsyncDispatcher will hang in this case. Could you update the test case to 
simulate this scenario and will actually hang without the core changes of the 
patch ?

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
> [0x7fb

[jira] [Commented] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle

2015-07-02 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612307#comment-14612307
 ] 

Varun Saxena commented on YARN-3047:


Updated a new patch. [~sjlee0], [~zjshen], kindly review

> [Data Serving] Set up ATS reader with basic request serving structure and 
> lifecycle
> ---
>
> Key: YARN-3047
> URL: https://issues.apache.org/jira/browse/YARN-3047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
>  Labels: BB2015-05-TBR
> Attachments: Timeline_Reader(draft).pdf, 
> YARN-3047-YARN-2928.08.patch, YARN-3047-YARN-2928.09.patch, 
> YARN-3047-YARN-2928.10.patch, YARN-3047.001.patch, YARN-3047.003.patch, 
> YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, 
> YARN-3047.02.patch, YARN-3047.04.patch
>
>
> Per design in YARN-2938, set up the ATS reader as a service and implement the 
> basic structure as a service. It includes lifecycle management, request 
> serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle

2015-07-02 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3047:
---
Attachment: YARN-3047-YARN-2928.10.patch

> [Data Serving] Set up ATS reader with basic request serving structure and 
> lifecycle
> ---
>
> Key: YARN-3047
> URL: https://issues.apache.org/jira/browse/YARN-3047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
>  Labels: BB2015-05-TBR
> Attachments: Timeline_Reader(draft).pdf, 
> YARN-3047-YARN-2928.08.patch, YARN-3047-YARN-2928.09.patch, 
> YARN-3047-YARN-2928.10.patch, YARN-3047.001.patch, YARN-3047.003.patch, 
> YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, 
> YARN-3047.02.patch, YARN-3047.04.patch
>
>
> Per design in YARN-2938, set up the ATS reader as a service and implement the 
> basic structure as a service. It includes lifecycle management, request 
> serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3880) Writing more RM side app-level metrics

2015-07-02 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-3880:
-

 Summary: Writing more RM side app-level metrics
 Key: YARN-3880
 URL: https://issues.apache.org/jira/browse/YARN-3880
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen


In YARN-3044, we implemented an analog of metrics publisher for ATS v1. While 
it helps to write app/attempt/container life cycle events, it really doesn't 
write  as many app-level system metrics that RM are now having.  Just list the 
metrics that I found missing:

* runningContainers
* memorySeconds
* vcoreSeconds
* preemptedResourceMB
* preemptedResourceVCores
* numNonAMContainerPreempted
* numAMContainerPreempted

Please feel fee to add more into the list if you find it's not covered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-07-02 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3849:
--
Attachment: 0004-YARN-3849.patch

Yes [~leftnoteasy] . You are correct, thanks for pointing out. I update the 
patch. :)

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-07-02 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612187#comment-14612187
 ] 

Wangda Tan commented on YARN-3849:
--

[~sunilg],
Thanks for update, but testPreemptionWithVCoreResource has the similar issue.
{code}
{"100:100", "10:100", "0"}, // used
{code}
Could you fix it as well?


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-07-02 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612159#comment-14612159
 ] 

Sunil G commented on YARN-2004:
---

Thank you [~jianhe] for the comments.

- bq.Or this method has more responsibility than that ?
Yes. We are planning to check for acl's (priority acls) in this method. I was 
planning to handle that in separate ticket.
{noformat}
yarn.scheduler.capacity.root...acl=user1,user2
{noformat}
This config will be in queue level, and we could restrict certain users to use 
some high priority. So only a certain  users can use high priority, and other 
wont be able to submit application in that priority. This acl check was 
planning to add into  {{authenticateApplicationPriority}}.
- bq.we may merge the two into a single patch ?
I will merge these patches together and will upload into YARN-2003. 

> Priority scheduling support in Capacity scheduler
> -
>
> Key: YARN-2004
> URL: https://issues.apache.org/jira/browse/YARN-2004
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 
> 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 
> 0006-YARN-2004.patch, 0007-YARN-2004.patch, 0008-YARN-2004.patch, 
> 0009-YARN-2004.patch, 0010-YARN-2004.patch
>
>
> Based on the priority of the application, Capacity Scheduler should be able 
> to give preference to application while doing scheduling.
> Comparator applicationComparator can be changed as below.   
> 
> 1.Check for Application priority. If priority is available, then return 
> the highest priority job.
> 2.Otherwise continue with existing logic such as App ID comparison and 
> then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-07-02 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612117#comment-14612117
 ] 

Varun Saxena commented on YARN-3051:


Ok...Will make the change

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
> YARN-3051-YARN-2928.05.patch, YARN-3051-YARN-2928.06.patch, 
> YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, 
> YARN-3051.Reader_API_2.patch, YARN-3051.Reader_API_3.patch, 
> YARN-3051.Reader_API_4.patch, YARN-3051.wip.02.YARN-2928.patch, 
> YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-07-02 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612113#comment-14612113
 ] 

Zhijie Shen commented on YARN-3051:
---

2. I meant we store  in a CSV file. Thoughts?

3. I think FS impl related config shouldn't be put in api as the impl not 
supposed to be used by public, but for test purpose.

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
> YARN-3051-YARN-2928.05.patch, YARN-3051-YARN-2928.06.patch, 
> YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, 
> YARN-3051.Reader_API_2.patch, YARN-3051.Reader_API_3.patch, 
> YARN-3051.Reader_API_4.patch, YARN-3051.wip.02.YARN-2928.patch, 
> YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-07-02 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3849:
--
Attachment: 0003-YARN-3849.patch

Thank you [~leftnoteasy] for the comments.
Uploading a patch addressing the issues.

Regarding one comment, 
bq.testPreemptionWithVCoreResource seems not correct, root.used != A.used + 
b.used

{noformat}
"root(=[100:200 100:200 100:200 100:200],x=[100:200 100:200  100:200 100:200]);"

   "-a(=[50:100  100:200   20:40   50:100],x=[50:100  100:200  80:160 
50:100]);" + // a
   "-b(=[50:100  100:200   80:160  50:100],x=[50:100  100:200  20:40  
50:100])"; 
{noformat}

Here now root.used = a.used+b.used. Please help to check.

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions

2015-07-02 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3877:
---
Attachment: YARN-3877.01.patch

> YarnClientImpl.submitApplication swallows exceptions
> 
>
> Key: YARN-3877
> URL: https://issues.apache.org/jira/browse/YARN-3877
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3877.01.patch
>
>
> When {{YarnClientImpl.submitApplication}} spins waiting for the application 
> to be accepted, any interruption during its Sleep() calls are logged and 
> swallowed.
> this makes it hard to interrupt the thread during shutdown. Really it should 
> throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-07-02 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3840:
---
Attachment: YARN-3840-5.patch

> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3846) RM Web UI queue filter is not working

2015-07-02 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3846:
---
Labels: PatchAvailable  (was: )

Not adding any test case
Change is only js code.

> RM Web UI queue filter is not working
> -
>
> Key: YARN-3846
> URL: https://issues.apache.org/jira/browse/YARN-3846
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
>  Labels: PatchAvailable
> Attachments: YARN-3846.patch, scheduler queue issue.png, scheduler 
> queue positive behavior.png
>
>
> Click on root queue will show the complete applications
> But click on the leaf queue is not filtering the application related to the 
> the clicked queue.
> The regular expression seems to be wrong 
> {code}
> q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';",
> {code}
> For example
> 1. Suppose  queue name is  b
> them the above expression will try to substr at index 1 
> q.lastIndexOf(':')  = -1
> -1+2= 1
> which is wrong. its should look at the 0 index.
> 2. if queue name is ab.x
> then it will parse it to .x 
> but it should be x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3846) RM Web UI queue filter is not working

2015-07-02 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3846:
---
Attachment: YARN-3846.patch

Please review attached patch

> RM Web UI queue filter is not working
> -
>
> Key: YARN-3846
> URL: https://issues.apache.org/jira/browse/YARN-3846
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Mohammad Shahid Khan
>Assignee: Mohammad Shahid Khan
> Attachments: YARN-3846.patch, scheduler queue issue.png, scheduler 
> queue positive behavior.png
>
>
> Click on root queue will show the complete applications
> But click on the leaf queue is not filtering the application related to the 
> the clicked queue.
> The regular expression seems to be wrong 
> {code}
> q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';",
> {code}
> For example
> 1. Suppose  queue name is  b
> them the above expression will try to substr at index 1 
> q.lastIndexOf(':')  = -1
> -1+2= 1
> which is wrong. its should look at the 0 index.
> 2. if queue name is ab.x
> then it will parse it to .x 
> but it should be x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-07-02 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3840:
---
Attachment: YARN-3840-4.patch

Attached patch having test cases 

> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3878:
---
Attachment: YARN-3878.02.patch

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
> [0x7fb989851000]
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x000700b79430> (a java.lang.Object)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.se

[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3878:
---
Attachment: (was: YARN-3878.02.patch)

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
> [0x7fb989851000]
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x000700b79430> (a java.lang.Object)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop

[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3878:
---
Attachment: YARN-3878.02.patch

[~jianhe], added a test case

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
> [0x7fb989851000]
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x000700b79430> (a java.lang.Object)
>   at 
> org.apache.hadoop

[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2015-07-02 Thread cntic (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: YARN-2681.patch

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
>  Labels: BB2015-05-TBR
> Fix For: 2.7.0
>
> Attachments: HdfsTrafficControl_UML.png, Traffic Control Design.png, 
> YARN-2681.patch, YARN-2681.patch, YARN-2681.patch
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf
> Implementation: http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3508) Prevent processing preemption events on the main RM dispatcher

2015-07-02 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611716#comment-14611716
 ] 

Varun Saxena commented on YARN-3508:


[~leftnoteasy], updated patch for branch-2.7

> Prevent processing preemption events on the main RM dispatcher
> --
>
> Key: YARN-3508
> URL: https://issues.apache.org/jira/browse/YARN-3508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-3508-branch-2.7.01.patch, YARN-3508.002.patch, 
> YARN-3508.01.patch, YARN-3508.03.patch, YARN-3508.04.patch, 
> YARN-3508.05.patch, YARN-3508.06.patch
>
>
> We recently saw the RM for a large cluster lag far behind on the 
> AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
> blocked on the highly-contended CapacityScheduler lock trying to dispatch 
> preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
> processing should occur on the scheduler event dispatcher thread or a 
> separate thread to avoid delaying the processing of other events in the 
> primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3508) Prevent processing preemption events on the main RM dispatcher

2015-07-02 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3508:
---
Attachment: YARN-3508-branch-2.7.01.patch

> Prevent processing preemption events on the main RM dispatcher
> --
>
> Key: YARN-3508
> URL: https://issues.apache.org/jira/browse/YARN-3508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-3508-branch-2.7.01.patch, YARN-3508.002.patch, 
> YARN-3508.01.patch, YARN-3508.03.patch, YARN-3508.04.patch, 
> YARN-3508.05.patch, YARN-3508.06.patch
>
>
> We recently saw the RM for a large cluster lag far behind on the 
> AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
> blocked on the highly-contended CapacityScheduler lock trying to dispatch 
> preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
> processing should occur on the scheduler event dispatcher thread or a 
> separate thread to avoid delaying the processing of other events in the 
> primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

58 matches

Mail list logo