date:20150702


[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611716#comment-14611716
 ] 

Varun Saxena commented on YARN-3508:


[~leftnoteasy], updated patch for branch-2.7

 Prevent processing preemption events on the main RM dispatcher
 --

 Key: YARN-3508
 URL: https://issues.apache.org/jira/browse/YARN-3508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3508-branch-2.7.01.patch, YARN-3508.002.patch, 
 YARN-3508.01.patch, YARN-3508.03.patch, YARN-3508.04.patch, 
 YARN-3508.05.patch, YARN-3508.06.patch


 We recently saw the RM for a large cluster lag far behind on the 
 AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
 blocked on the highly-contended CapacityScheduler lock trying to dispatch 
 preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
 processing should occur on the scheduler event dispatcher thread or a 
 separate thread to avoid delaying the processing of other events in the 
 primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3508) Prevent processing preemption events on the main RM dispatcher


 [ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3508:
---
Attachment: YARN-3508-branch-2.7.01.patch

 Prevent processing preemption events on the main RM dispatcher
 --

 Key: YARN-3508
 URL: https://issues.apache.org/jira/browse/YARN-3508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3508-branch-2.7.01.patch, YARN-3508.002.patch, 
 YARN-3508.01.patch, YARN-3508.03.patch, YARN-3508.04.patch, 
 YARN-3508.05.patch, YARN-3508.06.patch


 We recently saw the RM for a large cluster lag far behind on the 
 AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
 blocked on the highly-contended CapacityScheduler lock trying to dispatch 
 preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
 processing should occur on the scheduler event dispatcher thread or a 
 separate thread to avoid delaying the processing of other events in the 
 primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2015-07-02 Thread cntic (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: YARN-2681.patch

 Support bandwidth enforcement for containers while reading from HDFS
 

 Key: YARN-2681
 URL: https://issues.apache.org/jira/browse/YARN-2681
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.5.1
 Environment: Linux
Reporter: cntic
  Labels: BB2015-05-TBR
 Fix For: 2.7.0

 Attachments: HdfsTrafficControl_UML.png, Traffic Control Design.png, 
 YARN-2681.patch, YARN-2681.patch, YARN-2681.patch


 To read/write data from HDFS on data node, applications establise TCP/IP 
 connections with the datanode. The HDFS read can be controled by setting 
 Linux Traffic Control  (TC) subsystem on the data node to make filters on 
 appropriate connections.
 The current cgroups net_cls concept can not be applied on the node where the 
 container is launched, netheir on data node since:
 -   TC hanldes outgoing bandwidth only, so it can be set on container node 
 (HDFS read = incoming data for the container)
 -   Since HDFS data node is handled by only one process,  it is not possible 
 to use net_cls to separate connections from different containers to the 
 datanode.
 Tasks:
 1) Extend Resource model to define bandwidth enforcement rate
 2) Monitor TCP/IP connection estabilised by container handling process and 
 its child processes
 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
 order to enforce bandwidth of outgoing data
 Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf
 Implementation: http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop


 [ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3878:
---
Attachment: YARN-3878.02.patch

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch, YARN-3878.02.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
 at java.lang.Thread.run(Thread.java:744)
 main prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
 [0x7fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x000700b79430 (a java.lang.Object)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)
   - locked 0x000700b79430 (a java.lang.Object)

[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions


 [ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3877:
---
Attachment: YARN-3877.01.patch

 YarnClientImpl.submitApplication swallows exceptions
 

 Key: YARN-3877
 URL: https://issues.apache.org/jira/browse/YARN-3877
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.2
Reporter: Steve Loughran
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3877.01.patch


 When {{YarnClientImpl.submitApplication}} spins waiting for the application 
 to be accepted, any interruption during its Sleep() calls are logged and 
 swallowed.
 this makes it hard to interrupt the thread during shutdown. Really it should 
 throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id 9999)


 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3840:
---
Attachment: YARN-3840-5.patch

 Resource Manager web ui issue when sorting application by id (with 
 application having id  )
 

 Key: YARN-3840
 URL: https://issues.apache.org/jira/browse/YARN-3840
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: LINTE
Assignee: Mohammad Shahid Khan
 Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
 YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch


 On the WEBUI, the global main view page : 
 http://resourcemanager:8088/cluster/apps doesn't display applications over 
 .
 With command line it works (# yarn application -list).
 Regards,
 Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop


 [ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3878:
---
Attachment: YARN-3878.02.patch

[~jianhe], added a test case

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch, YARN-3878.02.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
 at java.lang.Thread.run(Thread.java:744)
 main prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
 [0x7fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x000700b79430 (a java.lang.Object)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)
   - locked

[jira] [Updated] (YARN-3846) RM Web UI queue filter is not working


 [ 
https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3846:
---
Attachment: YARN-3846.patch

Please review attached patch

 RM Web UI queue filter is not working
 -

 Key: YARN-3846
 URL: https://issues.apache.org/jira/browse/YARN-3846
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0, 2.8.0
Reporter: Mohammad Shahid Khan
Assignee: Mohammad Shahid Khan
 Attachments: YARN-3846.patch, scheduler queue issue.png, scheduler 
 queue positive behavior.png


 Click on root queue will show the complete applications
 But click on the leaf queue is not filtering the application related to the 
 the clicked queue.
 The regular expression seems to be wrong 
 {code}
 q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';,
 {code}
 For example
 1. Suppose  queue name is  b
 them the above expression will try to substr at index 1 
 q.lastIndexOf(':')  = -1
 -1+2= 1
 which is wrong. its should look at the 0 index.
 2. if queue name is ab.x
 then it will parse it to .x 
 but it should be x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id 9999)


 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3840:
---
Attachment: YARN-3840-4.patch

Attached patch having test cases 

 Resource Manager web ui issue when sorting application by id (with 
 application having id  )
 

 Key: YARN-3840
 URL: https://issues.apache.org/jira/browse/YARN-3840
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: LINTE
Assignee: Mohammad Shahid Khan
 Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
 YARN-3840-3.patch, YARN-3840-4.patch


 On the WEBUI, the global main view page : 
 http://resourcemanager:8088/cluster/apps doesn't display applications over 
 .
 With command line it works (# yarn application -list).
 Regards,
 Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop


 [ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3878:
---
Attachment: (was: YARN-3878.02.patch)

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
 at java.lang.Thread.run(Thread.java:744)
 main prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
 [0x7fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x000700b79430 (a java.lang.Object)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)
   - locked 0x000700b79430 (a java.lang.Object)
   at

[jira] [Updated] (YARN-3846) RM Web UI queue filter is not working


 [ 
https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3846:
---
Labels: PatchAvailable  (was: )

Not adding any test case
Change is only js code.

 RM Web UI queue filter is not working
 -

 Key: YARN-3846
 URL: https://issues.apache.org/jira/browse/YARN-3846
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0, 2.8.0
Reporter: Mohammad Shahid Khan
Assignee: Mohammad Shahid Khan
  Labels: PatchAvailable
 Attachments: YARN-3846.patch, scheduler queue issue.png, scheduler 
 queue positive behavior.png


 Click on root queue will show the complete applications
 But click on the leaf queue is not filtering the application related to the 
 the clicked queue.
 The regular expression seems to be wrong 
 {code}
 q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';,
 {code}
 For example
 1. Suppose  queue name is  b
 them the above expression will try to substr at index 1 
 q.lastIndexOf(':')  = -1
 -1+2= 1
 which is wrong. its should look at the 0 index.
 2. if queue name is ab.x
 then it will parse it to .x 
 but it should be x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-07-02 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3849:
--
Attachment: 0003-YARN-3849.patch

Thank you [~leftnoteasy] for the comments.
Uploading a patch addressing the issues.

Regarding one comment, 
bq.testPreemptionWithVCoreResource seems not correct, root.used != A.used + 
b.used

{noformat}
root(=[100:200 100:200 100:200 100:200],x=[100:200 100:200  100:200 100:200]);

   -a(=[50:100  100:200   20:40   50:100],x=[50:100  100:200  80:160 
50:100]); + // a
   -b(=[50:100  100:200   80:160  50:100],x=[50:100  100:200  20:40  
50:100]); 
{noformat}

Here now root.used = a.used+b.used. Please help to check.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical
 Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
 0003-YARN-3849.patch


 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612511#comment-14612511
 ] 

Jian He commented on YARN-3878:
---

ah, sorry, I overlooked. 
lgtm,  thanks !

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch, YARN-3878.02.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
 at java.lang.Thread.run(Thread.java:744)
 main prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
 [0x7fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x000700b79430 (a java.lang.Object)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612678#comment-14612678
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
We may consider to provide two ways here:
- For legacy applications - like MR, AM already have done aggregation on these 
counters themselves.
- For new application to build against YARN after timeline service v2, AM can 
delegate YARN timeline service to do aggregation instead of do it themselves. 
Our data model and aggregation mechanism should assure YARN timeline service 
can aggregate these framework-specif metrics without get predefined.
{quote}

I think it's a little more complicated than that. If a new YARN application 
wants to delegate aggregation to the YARN timeline service, it still needs to 
do at least the following:
- add the framework-specific metrics to the YARN container
- do *not* add any of those metrics to the YARN application

The framework-specific metrics set on the containers would still be transmitted 
by the AM (not by the node managers). Then, the YARN timeline service could 
look at *any* container metrics and apply the uniform aggregation rules.

Hopefully YARN apps can add metric values to container entities (there should 
be a natural mapping from unit of work to containers), otherwise it won't work 
for them...

I think it is pretty natural and straightforward for AMs to aggregate and 
retain values at the app level, but even if they set it at the container level, 
it could work.

On the other hand, if your app wants to own aggregation, then it should not set 
the metrics on the containers, or it would be done twice.

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

[
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612687#comment-14612687
]

Junping Du commented on YARN-3815:
--

Thanks [~sjlee0] for comments!
bq. I think it is pretty natural and straightforward for AMs to aggregate and
retain values at the app level, but even if they set it at the container level,
it could work.
I would rather say it is natural before timeline service v2 comes out. :) We
don't have to make it at container level I think but also not necessary for AM
to retain and aggregate these values. AM could help to forward the values to
per app timeline collector but don't have to aggregate them. Vinod got more
ideas on this in offline discussion. [~vinodkv], can you comment on this?

bq. Note that we're not proposing to keep the average as a time series. So I'm
not sure if that is feasible.
If not, we may consider to change the proposal to support time series given the
data is not too much here.

bq. We also ruled out per-container averages (explained in the summary), so
per-task resource usage is not an example we're looking for.
I think per-container averages is not equal to per-container resource usage.
Understanding application's real resource consumption/usage is one of the core
use cases for new timeline service at the beginning so I don't think we should
rule out anything important here.

[Aggregation] Application/Flow/User/Queue Level Aggregations

Key: YARN-3815
URL: https://issues.apache.org/jira/browse/YARN-3815
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
Attachments: Timeline Service Nextgen Flow, User, Queue Level
Aggregations (v1).pdf, aggregation-design-discussion.pdf,
hbase-schema-proposal-for-aggregation.pdf

Per previous discussions in some design documents for YARN-2928, the basic
scenario is the query for stats can happen on:
- Application level, expect return: an application with aggregated stats
- Flow level, expect return: aggregated stats for a flow_run, flow_version
and flow
- User level, expect return: aggregated stats for applications submitted by
user
- Queue level, expect return: aggregated stats for applications within the
Queue
Application states is the basic building block for all other level
aggregations. We can provide Flow/User/Queue level aggregated statistics info
based on application states (a dedicated table for application states is
needed which is missing from previous design documents like HBase/Phoenix
schema design).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics


[ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612638#comment-14612638
 ] 

Zhijie Shen commented on YARN-3881:
---

Once the metrics are ready, we can build YARN/timeline service builtin webUI to 
show this information, as well as expose it via API, such that third party 
monitoring like ambari can integrate with it. I think it should be quite 
flexible.

 Writing RM cluster-level metrics
 

 Key: YARN-3881
 URL: https://issues.apache.org/jira/browse/YARN-3881
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: metrics.json


 RM has a bunch of metrics that we may want to write into the timeline backend 
 to. I attached the metrics.json that I've crawled via 
 {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
 three groups of metrics:
 1. QueueMetrics
 2. JvmMetrics
 3. ClusterMetrics
 The problem is that unlike other metrics belongs to a single application, 
 these ones belongs to RM or cluster-wide. Therefore, current write path is 
 not going to work for these metrics because they don't have the associated 
 user/flow/app context info. We need to rethink of modeling cross-app metrics 
 and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-02 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612651#comment-14612651
 ] 

Anubhav Dhoot commented on YARN-433:


LGTM

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch, YARN-433.2.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612798#comment-14612798
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
We don't have to make it at container level I think but also not necessary for 
AM to retain and aggregate these values. AM could help to forward the values to 
per app timeline collector but don't have to aggregate them. Vinod got more 
ideas on this in offline discussion. [~vinodkv], can you comment on this?
{quote}

Interesting. Could you or [~vinodkv] shed light on the idea? It would still 
need to be captured in an entity or entities, right? I would think sending it 
as part of the container entities would be simpler and more consistent (in that 
the per-app collector can simply look at all container metrics as subject to 
aggregation). I'd love to hear more about this.

{quote}
I think per-container averages is not equal to per-container resource usage. 
Understanding application's real resource consumption/usage is one of the core 
use cases for new timeline service at the beginning so I don't think we should 
rule out anything important here.
{quote}

How is the per-container resource usage different than the per-container 
average described in the summary? Could you kindly provide its definition?

No doubt understanding applications' real resource consumption/usage is 
critical. Between the individual container resource usage (which are all 
captured), the aggregated resource usage at the app/flow level (which the basic 
real time aggregation addresses), and the running averages/max of the 
aggregated resource usage at the app/flow level, I think it definitely covers 
that need. What would be the gap that's not addressed by the above data?

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup

2015-07-02 Thread Dian Fu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612805#comment-14612805
 ] 

Dian Fu commented on YARN-2923:
---

{quote}But would also would like to get inputs from other folks in the 
Opensource for exposing this interface in RM side... may be based on this i 
would like to move into hadoop-yarn-server-common.{quote}
Yes, of course.

 Support configuration based NodeLabelsProvider Service in Distributed Node 
 Label Configuration Setup 
 -

 Key: YARN-2923
 URL: https://issues.apache.org/jira/browse/YARN-2923
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, 
 YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, 
 YARN-2923.20150517-1.patch


 As part of Distributed Node Labels configuration we need to support Node 
 labels to be configured in Yarn-site.xml. And on modification of Node Labels 
 configuration in yarn-site.xml, NM should be able to get modified Node labels 
 from this NodeLabelsprovider service without NM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612684#comment-14612684
 ] 

Sangjin Lee commented on YARN-3815:
---

{quote}
The use case here should be obviously. A quick real life example here is Google 
Borg - cluster management tools 
(http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf)
 which aggregate per-task resource usage information for usage-based charging, 
debugging job and long-term capacity planning.
{quote}

Thanks [~djp]. What I'm looking for is a little more specific examples. That's 
why we spent some time during the discussion to define precisely what we mean 
by averages. We discovered that there were already two different definitions 
of the average for gauges. We also ruled out per-container averages (explained 
in the summary), so per-task resource usage is not an example we're looking for.

So as for the moving (but aggregate) average, are there other examples? What we 
discussed during the meeting (also in the summary) was the total CPU 
utilization of an app/flow. Other examples, and how they might be useful, or is 
that pretty much the best example?

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-07-02 Thread Inigo Goiri (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612702#comment-14612702
 ] 

Inigo Goiri commented on YARN-313:
--

Up to you [~djp], you did the work.
I'm just trying to keep it up to date with trunk.

 Add Admin API for supporting node resource configuration in command line
 

 Key: YARN-313
 URL: https://issues.apache.org/jira/browse/YARN-313
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
 YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch


 We should provide some admin interface, e.g. yarn rmadmin -refreshResources 
 to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612680#comment-14612680
 ] 

Sangjin Lee commented on YARN-3815:
---

bq. This way sounds very clever. In addition, if we need resource consumption 
at any standpoint or time window (t1 - t2), we can simply do Avg(t2) * t2 - 
Avg(t1) * t1. This is much better than aggregating value on each stand point 
when query.

Note that we're not proposing to keep the average as a *time series*. So I'm 
not sure if that is feasible.

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.

2015-07-02 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3882:

Attachment: YARN-3882.000.patch

 AggregatedLogFormat should close aclScanner and ownerScanner after create 
 them.
 ---

 Key: YARN-3882
 URL: https://issues.apache.org/jira/browse/YARN-3882
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-3882.000.patch


 AggregatedLogFormat should close aclScanner and ownerScanner after create 
 them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in 
 {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. 
 {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them 
 after use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Devaraj K (JIRA)

Devaraj K created YARN-3883:
---

 Summary: YarnClient.getApplicationReport() doesn't not give 
diagnostics for the FINISHED state applications some times 
 Key: YARN-3883
 URL: https://issues.apache.org/jira/browse/YARN-3883
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Devaraj K


YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED 
state applications some times 

Below one is the report from the YarnClient.getApplicationReport(), It doesn't 
show the diagnostics for the application which has FinalStatus as FAILED and 
YarnApplicationState as FINISHED.
{code:xml}
15/07/03 15:53:27 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: XX.XXX.XX.XX
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1435918986890
 final status: FAILED
 tracking URL: 
http://stobdtserver2:8088/proxy/application_1435848120635_0015/
 user: root
{code}


But we can see the Diagnostics information in the RM Web UI for the same 
application.
{code:xml}
YarnApplicationState:   FINISHED
Queue:  default
FinalStatus Reported by AM: FAILED
Started:Fri Jul 03 15:53:06 +0530 2015
Elapsed:20sec
Tracking URL:   History
Log Aggregation Status  DISABLED
Diagnostics:User class threw exception: java.lang.NumberFormatException: 
For input string: xx
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula reassigned YARN-3883:
--

Assignee: Brahma Reddy Battula

 YarnClient.getApplicationReport() doesn't not give diagnostics for the 
 FINISHED state applications some times 
 --

 Key: YARN-3883
 URL: https://issues.apache.org/jira/browse/YARN-3883
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Devaraj K
Assignee: Brahma Reddy Battula

 YarnClient.getApplicationReport() doesn't not give diagnostics for the 
 FINISHED state applications some times 
 Below one is the report from the YarnClient.getApplicationReport(), It 
 doesn't show the diagnostics for the application which has FinalStatus as 
 FAILED and YarnApplicationState as FINISHED.
 {code:xml}
 15/07/03 15:53:27 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: XX.XXX.XX.XX
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1435918986890
  final status: FAILED
  tracking URL: 
 http://stobdtserver2:8088/proxy/application_1435848120635_0015/
  user: root
 {code}
 But we can see the Diagnostics information in the RM Web UI for the same 
 application.
 {code:xml}
 YarnApplicationState: FINISHED
 Queue:default
 FinalStatus Reported by AM:   FAILED
 Started:  Fri Jul 03 15:53:06 +0530 2015
 Elapsed:  20sec
 Tracking URL: History
 Log Aggregation StatusDISABLED
 Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
 For input string: xx
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612865#comment-14612865
 ] 

Brahma Reddy Battula commented on YARN-3883:


[~devaraj.k] I would like to work on this.. If you already started working on 
this, you can reassign.. thanks

 YarnClient.getApplicationReport() doesn't not give diagnostics for the 
 FINISHED state applications some times 
 --

 Key: YARN-3883
 URL: https://issues.apache.org/jira/browse/YARN-3883
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Devaraj K
Assignee: Brahma Reddy Battula

 YarnClient.getApplicationReport() doesn't not give diagnostics for the 
 FINISHED state applications some times 
 Below one is the report from the YarnClient.getApplicationReport(), It 
 doesn't show the diagnostics for the application which has FinalStatus as 
 FAILED and YarnApplicationState as FINISHED.
 {code:xml}
 15/07/03 15:53:27 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: XX.XXX.XX.XX
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1435918986890
  final status: FAILED
  tracking URL: 
 http://stobdtserver2:8088/proxy/application_1435848120635_0015/
  user: root
 {code}
 But we can see the Diagnostics information in the RM Web UI for the same 
 application.
 {code:xml}
 YarnApplicationState: FINISHED
 Queue:default
 FinalStatus Reported by AM:   FAILED
 Started:  Fri Jul 03 15:53:06 +0530 2015
 Elapsed:  20sec
 Tracking URL: History
 Log Aggregation StatusDISABLED
 Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
 For input string: xx
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612648#comment-14612648
 ] 

Junping Du commented on YARN-3815:
--

bq. Also, it would be GREAT if you could give a clear and compelling use case 
(a real life example) on why such support would be crucial. Thanks!
The use case here should be obviously. A quick real life example here is Google 
Borg - cluster management tools 
(http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf)
 which aggregate per-task resource usage information for usage-based charging, 
debugging job and long-term capacity planning.

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup

2015-07-02 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612706#comment-14612706
 ] 

Naganarasimha G R commented on YARN-2923:
-

Thanks [~dian.fu], for the review. But would also would like to get inputs from 
other folks in the Opensource for exposing this interface in RM side... may be 
based on this i would like to move into {{hadoop-yarn-server-common}}.
[~leftnoteasy], Its been a long time we revisited the distributed node labeling 
jira's can you please check and review once ...


 Support configuration based NodeLabelsProvider Service in Distributed Node 
 Label Configuration Setup 
 -

 Key: YARN-2923
 URL: https://issues.apache.org/jira/browse/YARN-2923
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, 
 YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, 
 YARN-2923.20150517-1.patch


 As part of Distributed Node Labels configuration we need to support Node 
 labels to be configured in Yarn-site.xml. And on modification of Node Labels 
 configuration in yarn-site.xml, NM should be able to get modified Node labels 
 from this NodeLabelsprovider service without NM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.

2015-07-02 Thread zhihai xu (JIRA)

zhihai xu created YARN-3882:
---

 Summary: AggregatedLogFormat should close aclScanner and 
ownerScanner after create them.
 Key: YARN-3882
 URL: https://issues.apache.org/jira/browse/YARN-3882
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


AggregatedLogFormat should close aclScanner and ownerScanner after create them. 
{{aclScanner}} and {{ownerScanner}} are created by createScanner in 
{{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. 
{{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them 
after use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612729#comment-14612729
 ] 

Devaraj K commented on YARN-3878:
-

Thanks [~varun_saxena] for the patch and [~jianhe] for the review. There are 
few minor comments on the patch, you can address these before getting this 
patch in.

* I think AsyncDispatcher.isDrained() method can be removed now from 
AsyncDispatcher and eventQueue.isEmpty() can verified directly in the tests.
* In TestAsyncDispatcher, can you remove this jira number comment and add the 
comment about what test does?
{code:xml}
/* Test to verify fix for YARN-3878 */
{code}
* In TestAsyncDispatcher, Please use *disp.close()* instead of disp.stop().

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch, YARN-3878.02.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at

[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-07-02 Thread Inigo Goiri (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-313:
-
Attachment: YARN-313-v5.patch

Fixed checkstyle (the ones that I could and made sense)
Fixed one unit test (the other one, I have no idea why it breaks, it's in the 
refreshNodes)

 Add Admin API for supporting node resource configuration in command line
 

 Key: YARN-313
 URL: https://issues.apache.org/jira/browse/YARN-313
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
 YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch


 We should provide some admin interface, e.g. yarn rmadmin -refreshResources 
 to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times

2015-07-02 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612857#comment-14612857
 ] 

Devaraj K commented on YARN-3883:
-

It is occurring due to this reason,

While creating ApplicationReport as part of 
ClientRMService.getApplicationReport(), It is setting YarnApplicationState as 
FINISHED even when the RMAppState is in FINISHING to the application report. 
Diagnostics is not available when the RMAppState is FINISHING and it is setting 
to RMAppImpl during the AppFinishedTransition when it is moving to FINISHED 
state.

 YarnClient.getApplicationReport() doesn't not give diagnostics for the 
 FINISHED state applications some times 
 --

 Key: YARN-3883
 URL: https://issues.apache.org/jira/browse/YARN-3883
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Devaraj K

 YarnClient.getApplicationReport() doesn't not give diagnostics for the 
 FINISHED state applications some times 
 Below one is the report from the YarnClient.getApplicationReport(), It 
 doesn't show the diagnostics for the application which has FinalStatus as 
 FAILED and YarnApplicationState as FINISHED.
 {code:xml}
 15/07/03 15:53:27 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: XX.XXX.XX.XX
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1435918986890
  final status: FAILED
  tracking URL: 
 http://stobdtserver2:8088/proxy/application_1435848120635_0015/
  user: root
 {code}
 But we can see the Diagnostics information in the RM Web UI for the same 
 application.
 {code:xml}
 YarnApplicationState: FINISHED
 Queue:default
 FinalStatus Reported by AM:   FAILED
 Started:  Fri Jul 03 15:53:06 +0530 2015
 Elapsed:  20sec
 Tracking URL: History
 Log Aggregation StatusDISABLED
 Diagnostics:  User class threw exception: java.lang.NumberFormatException: 
 For input string: xx
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-07-02 Thread Sunil G (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612159#comment-14612159
]

Sunil G commented on YARN-2004:
---

Thank you [~jianhe] for the comments.

- bq.Or this method has more responsibility than that ?
Yes. We are planning to check for acl's (priority acls) in this method. I was
planning to handle that in separate ticket.
{noformat}
yarn.scheduler.capacity.root.queue_name.priority.acl=user1,user2
{noformat}
This config will be in queue level, and we could restrict certain users to use
some high priority. So only a certain users can use high priority, and other
wont be able to submit application in that priority. This acl check was
planning to add into {{authenticateApplicationPriority}}.
- bq.we may merge the two into a single patch ?
I will merge these patches together and will upload into YARN-2003.

Priority scheduling support in Capacity scheduler
-

Key: YARN-2004
URL: https://issues.apache.org/jira/browse/YARN-2004
Project: Hadoop YARN
Issue Type: Sub-task
Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch,
0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch,
0006-YARN-2004.patch, 0007-YARN-2004.patch, 0008-YARN-2004.patch,
0009-YARN-2004.patch, 0010-YARN-2004.patch

Based on the priority of the application, Capacity Scheduler should be able
to give preference to application while doing scheduling.
ComparatorFiCaSchedulerApp applicationComparator can be changed as below.

1.Check for Application priority. If priority is available, then return
the highest priority job.
2.Otherwise continue with existing logic such as App ID comparison and
then TimeStamp comparison.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3880) Writing more RM side app-level metrics

Zhijie Shen created YARN-3880:
-

 Summary: Writing more RM side app-level metrics
 Key: YARN-3880
 URL: https://issues.apache.org/jira/browse/YARN-3880
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen


In YARN-3044, we implemented an analog of metrics publisher for ATS v1. While 
it helps to write app/attempt/container life cycle events, it really doesn't 
write  as many app-level system metrics that RM are now having.  Just list the 
metrics that I found missing:

* runningContainers
* memorySeconds
* vcoreSeconds
* preemptedResourceMB
* preemptedResourceVCores
* numNonAMContainerPreempted
* numAMContainerPreempted

Please feel fee to add more into the list if you find it's not covered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle


[ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612307#comment-14612307
 ] 

Varun Saxena commented on YARN-3047:


Updated a new patch. [~sjlee0], [~zjshen], kindly review

 [Data Serving] Set up ATS reader with basic request serving structure and 
 lifecycle
 ---

 Key: YARN-3047
 URL: https://issues.apache.org/jira/browse/YARN-3047
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
  Labels: BB2015-05-TBR
 Attachments: Timeline_Reader(draft).pdf, 
 YARN-3047-YARN-2928.08.patch, YARN-3047-YARN-2928.09.patch, 
 YARN-3047-YARN-2928.10.patch, YARN-3047.001.patch, YARN-3047.003.patch, 
 YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, 
 YARN-3047.02.patch, YARN-3047.04.patch


 Per design in YARN-2938, set up the ATS reader as a service and implement the 
 basic structure as a service. It includes lifecycle management, request 
 serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle


 [ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3047:
---
Attachment: YARN-3047-YARN-2928.10.patch

 [Data Serving] Set up ATS reader with basic request serving structure and 
 lifecycle
 ---

 Key: YARN-3047
 URL: https://issues.apache.org/jira/browse/YARN-3047
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
  Labels: BB2015-05-TBR
 Attachments: Timeline_Reader(draft).pdf, 
 YARN-3047-YARN-2928.08.patch, YARN-3047-YARN-2928.09.patch, 
 YARN-3047-YARN-2928.10.patch, YARN-3047.001.patch, YARN-3047.003.patch, 
 YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, 
 YARN-3047.02.patch, YARN-3047.04.patch


 Per design in YARN-2938, set up the ATS reader as a service and implement the 
 basic structure as a service. It includes lifecycle management, request 
 serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612117#comment-14612117
 ] 

Varun Saxena commented on YARN-3051:


Ok...Will make the change

 [Storage abstraction] Create backing storage read interface for ATS readers
 ---

 Key: YARN-3051
 URL: https://issues.apache.org/jira/browse/YARN-3051
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
 Attachments: YARN-3051-YARN-2928.003.patch, 
 YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
 YARN-3051-YARN-2928.05.patch, YARN-3051-YARN-2928.06.patch, 
 YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, 
 YARN-3051.Reader_API_2.patch, YARN-3051.Reader_API_3.patch, 
 YARN-3051.Reader_API_4.patch, YARN-3051.wip.02.YARN-2928.patch, 
 YARN-3051.wip.patch, YARN-3051_temp.patch


 Per design in YARN-2928, create backing storage read interface that can be 
 implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-07-02 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3849:
--
Attachment: 0004-YARN-3849.patch

Yes [~leftnoteasy] . You are correct, thanks for pointing out. I update the 
patch. :)

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical
 Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
 0003-YARN-3849.patch, 0004-YARN-3849.patch


 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612113#comment-14612113
 ] 

Zhijie Shen commented on YARN-3051:
---

2. I meant we store appId, user, flowId, flowRunId in a CSV file. Thoughts?

3. I think FS impl related config shouldn't be put in api as the impl not 
supposed to be used by public, but for test purpose.

 [Storage abstraction] Create backing storage read interface for ATS readers
 ---

 Key: YARN-3051
 URL: https://issues.apache.org/jira/browse/YARN-3051
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
 Attachments: YARN-3051-YARN-2928.003.patch, 
 YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, 
 YARN-3051-YARN-2928.05.patch, YARN-3051-YARN-2928.06.patch, 
 YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, 
 YARN-3051.Reader_API_2.patch, YARN-3051.Reader_API_3.patch, 
 YARN-3051.Reader_API_4.patch, YARN-3051.wip.02.YARN-2928.patch, 
 YARN-3051.wip.patch, YARN-3051_temp.patch


 Per design in YARN-2928, create backing storage read interface that can be 
 implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-07-02 Thread Wangda Tan (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612187#comment-14612187
]

Wangda Tan commented on YARN-3849:
--

[~sunilg],
Thanks for update, but testPreemptionWithVCoreResource has the similar issue.
{code}
{100:100, 10:100, 0}, // used
{code}
Could you fix it as well?

Too much of preemption activity causing continuos killing of containers
across queues
-

Key: YARN-3849
URL: https://issues.apache.org/jira/browse/YARN-3849
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical
Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch,
0003-YARN-3849.patch

Two queues are used. Each queue has given a capacity of 0.5. Dominant
Resource policy is used.
1. An app is submitted in QueueA which is consuming full cluster capacity
2. After submitting an app in QueueB, there are some demand and invoking
preemption in QueueA
3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that
all containers other than AM is getting killed in QueueA
4. Now the app in QueueB is trying to take over cluster with the current free
space. But there are some updated demand from the app in QueueA which lost
its containers earlier, and preemption is kicked in QueueB now.
Scenario in step 3 and 4 continuously happening in loop. Thus none of the
apps are completing.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3881) Writing RM cluster-level metrics


 [ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3881:
--
Attachment: metrics.json

 Writing RM cluster-level metrics
 

 Key: YARN-3881
 URL: https://issues.apache.org/jira/browse/YARN-3881
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: metrics.json


 RM has a bunch of metrics that we may want to write into the timeline backend 
 to. I attached the metrics.json that I've crawled via 
 {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
 three groups of metrics:
 1. QueueMetrics
 2. JvmMetrics
 3. ClusterMetrics
 The problem is that unlike other metrics belongs to a single application, 
 these ones belongs to RM or cluster-wide. Therefore, current write path is 
 not going to work for these metrics because they don't have the associated 
 user/flow/app context info. We need to rethink of modeling cross-app metrics 
 and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-07-02 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612320#comment-14612320
 ] 

Jian He commented on YARN-3878:
---

Hi [~varun_saxena], the test seems not adequate. It doesn't prove the 
AsyncDispatcher will hang in this case. Could you update the test case to 
simulate this scenario and will actually hang without the core changes of the 
patch ?

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch, YARN-3878.02.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
 at java.lang.Thread.run(Thread.java:744)
 main prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() 
 [0x7fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop


[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612342#comment-14612342
 ] 

Varun Saxena commented on YARN-3878:


[~jianhe], the test case as such is adequate.
I have basically added two assert statements. If first statement is true, and 
second is false, hang will occur.
But as assert statements are there, test would fail before hang occurs.
Without core changes of patch, test will fail at second assertion point. 
But if you remove this second assertion point, hang will occur and test case 
time out.
{code}
Assert.assertTrue(Event Queue should have been empty,
eventQueue.isEmpty());
Assert.assertTrue(Async Dispatcher should have been drained as event  +
queue is empty, disp.isDrained());

{code}

So do you want me to remove this second assertion statement so that test case 
doesnt fail before hang ? (without core changes).
Let me know.

 AsyncDispatcher can hang while stopping if it is configured for draining 
 events on stop
 ---

 Key: YARN-3878
 URL: https://issues.apache.org/jira/browse/YARN-3878
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3878.01.patch, YARN-3878.02.patch


 The sequence of events is as under :
 # RM is stopped while putting a RMStateStore Event to RMStateStore's 
 AsyncDispatcher. This leads to an Interrupted Exception being thrown.
 # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
 {{serviceStop}}, we will check if all events have been drained and wait for 
 event queue to drain(as RM State Store dispatcher is configured for queue to 
 drain on stop). 
 # This condition never becomes true and AsyncDispatcher keeps on waiting 
 incessantly for dispatcher event queue to drain till JVM exits.
 *Initial exception while posting RM State store event to queue*
 {noformat}
 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
 (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
 STOPPED
 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
 {noformat}
 *JStack of AsyncDispatcher hanging on stop*
 {noformat}
 AsyncDispatcher event handler prio=10 tid=0x7fb980222800 nid=0x4b1e 
 waiting on condition [0x7fb9654e9000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000700b79250 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at

[jira] [Created] (YARN-3881) Writing RM cluster-level metrics

Zhijie Shen created YARN-3881:
-

 Summary: Writing RM cluster-level metrics
 Key: YARN-3881
 URL: https://issues.apache.org/jira/browse/YARN-3881
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen


RM has a bunch of metrics that we may want to write into the timeline backend 
to. I attached the metrics.json that I've crawled via 
{{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
three groups of metrics:

1. QueueMetrics
2. JvmMetrics
3. ClusterMetrics

The problem is that unlike other metrics belongs to a single application, these 
ones belongs to RM or cluster-wide. Therefore, current write path is not going 
to work for these metrics because they don't have the associated user/flow/app 
context info. We need to rethink of modeling cross-app metrics and the api to 
handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics

[
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612429#comment-14612429
]

Zhijie Shen commented on YARN-3881:
---

IMHO, we need to add an addition API to direct write the cross app metrics (or
already aggregated metrics, if you think of these ones are actually the
aggregated data of each individual app, such as the counters of
submitted/pending/running apps) to the backend, in the separate tables, such as
cluster/queue/user tables, and these data don't need to be aggregated any more.

Writing RM cluster-level metrics

Key: YARN-3881
URL: https://issues.apache.org/jira/browse/YARN-3881
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Attachments: metrics.json

RM has a bunch of metrics that we may want to write into the timeline backend
to. I attached the metrics.json that I've crawled via
{{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to
three groups of metrics:
1. QueueMetrics
2. JvmMetrics
3. ClusterMetrics
The problem is that unlike other metrics belongs to a single application,
these ones belongs to RM or cluster-wide. Therefore, current write path is
not going to work for these metrics because they don't have the associated
user/flow/app context info. We need to rethink of modeling cross-app metrics
and the api to handle them.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics

2015-07-02 Thread Lei Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612432#comment-14612432
 ] 

Lei Guo commented on YARN-3881:
---

This is an interesting topic, assuming the timeline server provides this 
support, should Ambari or other monitoring tool to use this for monitoring 
purpose? If not, what's the scenario to write RM related metrics?

 Writing RM cluster-level metrics
 

 Key: YARN-3881
 URL: https://issues.apache.org/jira/browse/YARN-3881
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: metrics.json


 RM has a bunch of metrics that we may want to write into the timeline backend 
 to. I attached the metrics.json that I've crawled via 
 {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to 
 three groups of metrics:
 1. QueueMetrics
 2. JvmMetrics
 3. ClusterMetrics
 The problem is that unlike other metrics belongs to a single application, 
 these ones belongs to RM or cluster-wide. Therefore, current write path is 
 not going to work for these metrics because they don't have the associated 
 user/flow/app context info. We need to rethink of modeling cross-app metrics 
 and the api to handle them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


 [ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3815:
--
Attachment: hbase-schema-proposal-for-aggregation.pdf
aggregation-design-discussion.pdf

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line


 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-313:

Labels:   (was: BB2015-05-TBR)

 Add Admin API for supporting node resource configuration in command line
 

 Key: YARN-313
 URL: https://issues.apache.org/jira/browse/YARN-313
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
 YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch


 We should provide some admin interface, e.g. yarn rmadmin -refreshResources 
 to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line


[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612553#comment-14612553
 ] 

Junping Du commented on YARN-313:
-

Sorry for coming on this. [~elgoiri], are you interested in taking on this JIRA 
and move it forward? If so, I can assign it to you.

 Add Admin API for supporting node resource configuration in command line
 

 Key: YARN-313
 URL: https://issues.apache.org/jira/browse/YARN-313
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
 YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch


 We should provide some admin interface, e.g. yarn rmadmin -refreshResources 
 to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612592#comment-14612592
 ] 

Sangjin Lee commented on YARN-3815:
---

Here is my take on what's consensus, what's not, and what's currently out of 
scope. I may have misread the discussion and your impression/understanding may 
be different, so please feel free to chime in and comment on this!

(consensus or not controversial)
- applications table will be split from the main entities table
- app-level aggregation for framework-specific metrics will be done by the AM
- app-level aggregation for YARN-system container metrics will be done by the 
per-app timeline collector
- real-time aggregation does simple sum for all types of metrics
- metrics API will be updated to differentiate gauges and counters (the type 
information will need to be persisted in the storage)
- for gauges, in addition to the simple sum-based aggregation, support average 
and max
- the flow-run table will be created to handle app-to-flow-run (real-time) 
aggregation as proposed in the native HBase schema design
- auxiliary tables will be implemented as proposed in the native HBase schema 
design
- time-based aggregation (daily, weekly, monthly, etc.) will be done via 
phoenix tables to enable ad-hoc queries

(questions remaining or undecided)
- for the average/max support for gauges (see above), confirm that's exactly 
what we want to support
- how to implement app-to-flow-run aggregation for gauges
- how to perform the time-based aggregation (mapreduce, using co-processor 
endpoints, etc.)
- how to handle long-running apps for time-based aggregation
- considering adopting null delimiters (or other phoenix-friendly tools) to 
support phoenix reading data from the native HBase tables
- using flow collectors, user collectors, and queue collectors as means of 
performing (higher-level) aggregation

(out of scope)
- support per-container averages for gauges
- any aggregation other than time-based aggregation for flows, users, and queues
- creating a dependency on the explicit YARN flow API

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

[
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612591#comment-14612591
]

Junping Du commented on YARN-3815:
--

Thanks [~sjlee0] for nice writeup on the discussions.
Looks good for most parts to me. Some comments on app level aggregations:

bq. Framework‐specific metrics will be sent to the per‐app collector aggregated
by the AM itself.
We may consider to provide two ways here:
- For legacy applications - like MR, AM already have done aggregation on these
counters themselves.
- For new application to build against YARN after timeline service v2, AM can
delegate YARN timeline service to do aggregation instead of do it themselves.
Our data model and aggregation mechanism should assure YARN timeline service
can aggregate these framework-specif metrics without get predefined.

bq. time average max: the average multiplied by the elapsed time of the
application represents the total resource usage over time.
This way sounds very clever. In addition, if we need resource consumption at
any standpoint or time window (t1 - t2), we can simply do Avg(t2) * t2 -
Avg(t1) * t1. This is much better than aggregating value on each stand point
when query.

[Aggregation] Application/Flow/User/Queue Level Aggregations

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612529#comment-14612529
 ] 

Sangjin Lee commented on YARN-3815:
---

Some of us ([~gtCarrera9], [~vinodkv], [~djp], [~zjshen], [~vrushalic], and 
[~sjlee0]) had a face-to-face design discussion on the aggregation. I am going 
to post the summary of that discussion along with a proposal for an expanded 
native HBase schema to support aggregation.

I believe we are much closer to a consensus on the aggregation design, but some 
important questions still remain. For the sake of public discussion and 
inviting more participants and comments, we should follow up here on this JIRA.

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3445) Cache runningApps in RMNode for getting running apps on given NodeId

[
https://issues.apache.org/jira/browse/YARN-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612543#comment-14612543
]

Junping Du commented on YARN-3445:
--

Thanks for review and comments, [~mingma]!
bq. That is around 10M entries. So it should be ok for RM.
ApplicationId only contains int (4 bytes) and long (8 bytes) field. Even
consider java object header, padding and PB object overhead, should be far less
than 100 bytes. Agree that it should be fine even in large scale as mentioned
scenario.

bq. Do you need synchronizedList in the following list? It looks like the
access of runningApplications are protected by RMNodeImpl's readLock and
writeLock.
Nice catch! Will replace synchronizedList will ArrayList and add some
writeLocks (missing in previous patch).

Cache runningApps in RMNode for getting running apps on given NodeId

Key: YARN-3445
URL: https://issues.apache.org/jira/browse/YARN-3445
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Junping Du
Assignee: Junping Du
Attachments: YARN-3445-v2.patch, YARN-3445-v3.1.patch,
YARN-3445-v3.patch, YARN-3445.patch

Per discussion in YARN-3334, we need filter out unnecessary collectors info
from RM in heartbeat response. Our propose is to add cache for runningApps in
RMNode, so RM only send collectors for local running apps back. This is also
needed in YARN-914 (graceful decommission) that if no running apps in NM
which is in decommissioning stage, it will get decommissioned immediately.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations


[ 
https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612620#comment-14612620
 ] 

Junping Du commented on YARN-3815:
--

bq.  app-level aggregation for framework-specific metrics will be done by the 
AM.
I think there is a little misunderstanding on this - just like I mentioned 
above, AM should/could get relieved from aggregating counters themselves after 
timeline service v2. Legacy AMs could still push aggregated counters to backend 
storage though. Others who also sit in the room, any comments here? 

 [Aggregation] Application/Flow/User/Queue Level Aggregations
 

 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: Timeline Service Nextgen Flow, User, Queue Level 
 Aggregations (v1).pdf, aggregation-design-discussion.pdf, 
 hbase-schema-proposal-for-aggregation.pdf


 Per previous discussions in some design documents for YARN-2928, the basic 
 scenario is the query for stats can happen on:
 - Application level, expect return: an application with aggregated stats
 - Flow level, expect return: aggregated stats for a flow_run, flow_version 
 and flow 
 - User level, expect return: aggregated stats for applications submitted by 
 user
 - Queue level, expect return: aggregated stats for applications within the 
 Queue
 Application states is the basic building block for all other level 
 aggregations. We can provide Flow/User/Queue level aggregated statistics info 
 based on application states (a dedicated table for application states is 
 needed which is missing from previous design documents like HBase/Phoenix 
 schema design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations