[jira] [Commented] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.
[ https://issues.apache.org/jira/browse/YARN-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612909#comment-14612909 ] Xuan Gong commented on YARN-3882: - +1 LGTM. Pending Jenkins > AggregatedLogFormat should close aclScanner and ownerScanner after create > them. > --- > > Key: YARN-3882 > URL: https://issues.apache.org/jira/browse/YARN-3882 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-3882.000.patch > > > AggregatedLogFormat should close aclScanner and ownerScanner after create > them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in > {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. > {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them > after use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times
[ https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612872#comment-14612872 ] Devaraj K commented on YARN-3883: - Please go ahead [~brahmareddy]. > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > -- > > Key: YARN-3883 > URL: https://issues.apache.org/jira/browse/YARN-3883 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: Devaraj K >Assignee: Brahma Reddy Battula > > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > Below one is the report from the YarnClient.getApplicationReport(), It > doesn't show the diagnostics for the application which has FinalStatus as > FAILED and YarnApplicationState as FINISHED. > {code:xml} > 15/07/03 15:53:27 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: XX.XXX.XX.XX > ApplicationMaster RPC port: 0 > queue: default > start time: 1435918986890 > final status: FAILED > tracking URL: > http://stobdtserver2:8088/proxy/application_1435848120635_0015/ > user: root > {code} > But we can see the Diagnostics information in the RM Web UI for the same > application. > {code:xml} > YarnApplicationState: FINISHED > Queue:default > FinalStatus Reported by AM: FAILED > Started: Fri Jul 03 15:53:06 +0530 2015 > Elapsed: 20sec > Tracking URL: History > Log Aggregation StatusDISABLED > Diagnostics: User class threw exception: java.lang.NumberFormatException: > For input string: "xx" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times
[ https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612865#comment-14612865 ] Brahma Reddy Battula commented on YARN-3883: [~devaraj.k] I would like to work on this.. If you already started working on this, you can reassign.. thanks > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > -- > > Key: YARN-3883 > URL: https://issues.apache.org/jira/browse/YARN-3883 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: Devaraj K >Assignee: Brahma Reddy Battula > > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > Below one is the report from the YarnClient.getApplicationReport(), It > doesn't show the diagnostics for the application which has FinalStatus as > FAILED and YarnApplicationState as FINISHED. > {code:xml} > 15/07/03 15:53:27 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: XX.XXX.XX.XX > ApplicationMaster RPC port: 0 > queue: default > start time: 1435918986890 > final status: FAILED > tracking URL: > http://stobdtserver2:8088/proxy/application_1435848120635_0015/ > user: root > {code} > But we can see the Diagnostics information in the RM Web UI for the same > application. > {code:xml} > YarnApplicationState: FINISHED > Queue:default > FinalStatus Reported by AM: FAILED > Started: Fri Jul 03 15:53:06 +0530 2015 > Elapsed: 20sec > Tracking URL: History > Log Aggregation StatusDISABLED > Diagnostics: User class threw exception: java.lang.NumberFormatException: > For input string: "xx" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times
[ https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula reassigned YARN-3883: -- Assignee: Brahma Reddy Battula > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > -- > > Key: YARN-3883 > URL: https://issues.apache.org/jira/browse/YARN-3883 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: Devaraj K >Assignee: Brahma Reddy Battula > > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > Below one is the report from the YarnClient.getApplicationReport(), It > doesn't show the diagnostics for the application which has FinalStatus as > FAILED and YarnApplicationState as FINISHED. > {code:xml} > 15/07/03 15:53:27 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: XX.XXX.XX.XX > ApplicationMaster RPC port: 0 > queue: default > start time: 1435918986890 > final status: FAILED > tracking URL: > http://stobdtserver2:8088/proxy/application_1435848120635_0015/ > user: root > {code} > But we can see the Diagnostics information in the RM Web UI for the same > application. > {code:xml} > YarnApplicationState: FINISHED > Queue:default > FinalStatus Reported by AM: FAILED > Started: Fri Jul 03 15:53:06 +0530 2015 > Elapsed: 20sec > Tracking URL: History > Log Aggregation StatusDISABLED > Diagnostics: User class threw exception: java.lang.NumberFormatException: > For input string: "xx" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times
[ https://issues.apache.org/jira/browse/YARN-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612857#comment-14612857 ] Devaraj K commented on YARN-3883: - It is occurring due to this reason, While creating ApplicationReport as part of ClientRMService.getApplicationReport(), It is setting YarnApplicationState as FINISHED even when the RMAppState is in FINISHING to the application report. Diagnostics is not available when the RMAppState is FINISHING and it is setting to RMAppImpl during the AppFinishedTransition when it is moving to FINISHED state. > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > -- > > Key: YARN-3883 > URL: https://issues.apache.org/jira/browse/YARN-3883 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: Devaraj K > > YarnClient.getApplicationReport() doesn't not give diagnostics for the > FINISHED state applications some times > Below one is the report from the YarnClient.getApplicationReport(), It > doesn't show the diagnostics for the application which has FinalStatus as > FAILED and YarnApplicationState as FINISHED. > {code:xml} > 15/07/03 15:53:27 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: XX.XXX.XX.XX > ApplicationMaster RPC port: 0 > queue: default > start time: 1435918986890 > final status: FAILED > tracking URL: > http://stobdtserver2:8088/proxy/application_1435848120635_0015/ > user: root > {code} > But we can see the Diagnostics information in the RM Web UI for the same > application. > {code:xml} > YarnApplicationState: FINISHED > Queue:default > FinalStatus Reported by AM: FAILED > Started: Fri Jul 03 15:53:06 +0530 2015 > Elapsed: 20sec > Tracking URL: History > Log Aggregation StatusDISABLED > Diagnostics: User class threw exception: java.lang.NumberFormatException: > For input string: "xx" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3883) YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times
Devaraj K created YARN-3883: --- Summary: YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times Key: YARN-3883 URL: https://issues.apache.org/jira/browse/YARN-3883 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Devaraj K YarnClient.getApplicationReport() doesn't not give diagnostics for the FINISHED state applications some times Below one is the report from the YarnClient.getApplicationReport(), It doesn't show the diagnostics for the application which has FinalStatus as FAILED and YarnApplicationState as FINISHED. {code:xml} 15/07/03 15:53:27 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: XX.XXX.XX.XX ApplicationMaster RPC port: 0 queue: default start time: 1435918986890 final status: FAILED tracking URL: http://stobdtserver2:8088/proxy/application_1435848120635_0015/ user: root {code} But we can see the Diagnostics information in the RM Web UI for the same application. {code:xml} YarnApplicationState: FINISHED Queue: default FinalStatus Reported by AM: FAILED Started:Fri Jul 03 15:53:06 +0530 2015 Elapsed:20sec Tracking URL: History Log Aggregation Status DISABLED Diagnostics:User class threw exception: java.lang.NumberFormatException: For input string: "xx" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612805#comment-14612805 ] Dian Fu commented on YARN-2923: --- {quote}But would also would like to get inputs from other folks in the Opensource for exposing this interface in RM side... may be based on this i would like to move into hadoop-yarn-server-common.{quote} Yes, of course. > Support configuration based NodeLabelsProvider Service in Distributed Node > Label Configuration Setup > - > > Key: YARN-2923 > URL: https://issues.apache.org/jira/browse/YARN-2923 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Fix For: 2.8.0 > > Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, > YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, > YARN-2923.20150517-1.patch > > > As part of Distributed Node Labels configuration we need to support Node > labels to be configured in Yarn-site.xml. And on modification of Node Labels > configuration in yarn-site.xml, NM should be able to get modified Node labels > from this NodeLabelsprovider service without NM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612798#comment-14612798 ] Sangjin Lee commented on YARN-3815: --- {quote} We don't have to make it at container level I think but also not necessary for AM to retain and aggregate these values. AM could help to forward the values to per app timeline collector but don't have to aggregate them. Vinod got more ideas on this in offline discussion. [~vinodkv], can you comment on this? {quote} Interesting. Could you or [~vinodkv] shed light on the idea? It would still need to be captured in an entity or entities, right? I would think sending it as part of the container entities would be simpler and more consistent (in that the per-app collector can simply look at all container metrics as subject to aggregation). I'd love to hear more about this. {quote} I think "per-container averages" is not equal to per-container resource usage. Understanding application's real resource consumption/usage is one of the core use cases for new timeline service at the beginning so I don't think we should rule out anything important here. {quote} How is the per-container resource usage different than the per-container average described in the summary? Could you kindly provide its definition? No doubt understanding applications' real resource consumption/usage is critical. Between the individual container resource usage (which are all captured), the aggregated resource usage at the app/flow level (which the basic real time aggregation addresses), and the running averages/max of the aggregated resource usage at the app/flow level, I think it definitely covers that need. What would be the gap that's not addressed by the above data? > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line
[ https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-313: - Attachment: YARN-313-v5.patch Fixed checkstyle (the ones that I could and made sense) Fixed one unit test (the other one, I have no idea why it breaks, it's in the refreshNodes) > Add Admin API for supporting node resource configuration in command line > > > Key: YARN-313 > URL: https://issues.apache.org/jira/browse/YARN-313 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-313-sample.patch, YARN-313-v1.patch, > YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch, YARN-313-v5.patch > > > We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" > to support changes of node's resource specified in a config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612729#comment-14612729 ] Devaraj K commented on YARN-3878: - Thanks [~varun_saxena] for the patch and [~jianhe] for the review. There are few minor comments on the patch, you can address these before getting this patch in. * I think AsyncDispatcher.isDrained() method can be removed now from AsyncDispatcher and eventQueue.isEmpty() can verified directly in the tests. * In TestAsyncDispatcher, can you remove this jira number comment and add the comment about what test does? {code:xml} /* Test to verify fix for YARN-3878 */ {code} * In TestAsyncDispatcher, Please use *disp.close()* instead of disp.stop(). > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch, YARN-3878.02.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.jav
[jira] [Updated] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.
[ https://issues.apache.org/jira/browse/YARN-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3882: Attachment: YARN-3882.000.patch > AggregatedLogFormat should close aclScanner and ownerScanner after create > them. > --- > > Key: YARN-3882 > URL: https://issues.apache.org/jira/browse/YARN-3882 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-3882.000.patch > > > AggregatedLogFormat should close aclScanner and ownerScanner after create > them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in > {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. > {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them > after use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.
zhihai xu created YARN-3882: --- Summary: AggregatedLogFormat should close aclScanner and ownerScanner after create them. Key: YARN-3882 URL: https://issues.apache.org/jira/browse/YARN-3882 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor AggregatedLogFormat should close aclScanner and ownerScanner after create them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them after use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612706#comment-14612706 ] Naganarasimha G R commented on YARN-2923: - Thanks [~dian.fu], for the review. But would also would like to get inputs from other folks in the Opensource for exposing this interface in RM side... may be based on this i would like to move into {{hadoop-yarn-server-common}}. [~leftnoteasy], Its been a long time we revisited the distributed node labeling jira's can you please check and review once ... > Support configuration based NodeLabelsProvider Service in Distributed Node > Label Configuration Setup > - > > Key: YARN-2923 > URL: https://issues.apache.org/jira/browse/YARN-2923 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Fix For: 2.8.0 > > Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, > YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, > YARN-2923.20150517-1.patch > > > As part of Distributed Node Labels configuration we need to support Node > labels to be configured in Yarn-site.xml. And on modification of Node Labels > configuration in yarn-site.xml, NM should be able to get modified Node labels > from this NodeLabelsprovider service without NM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line
[ https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612702#comment-14612702 ] Inigo Goiri commented on YARN-313: -- Up to you [~djp], you did the work. I'm just trying to keep it up to date with trunk. > Add Admin API for supporting node resource configuration in command line > > > Key: YARN-313 > URL: https://issues.apache.org/jira/browse/YARN-313 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-313-sample.patch, YARN-313-v1.patch, > YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch > > > We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" > to support changes of node's resource specified in a config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612687#comment-14612687 ] Junping Du commented on YARN-3815: -- Thanks [~sjlee0] for comments! bq. I think it is pretty natural and straightforward for AMs to aggregate and retain values at the app level, but even if they set it at the container level, it could work. I would rather say it is "natural" before timeline service v2 comes out. :) We don't have to make it at container level I think but also not necessary for AM to retain and aggregate these values. AM could help to forward the values to per app timeline collector but don't have to aggregate them. Vinod got more ideas on this in offline discussion. [~vinodkv], can you comment on this? bq. Note that we're not proposing to keep the average as a time series. So I'm not sure if that is feasible. If not, we may consider to change the proposal to support time series given the data is not too much here. bq. We also ruled out per-container averages (explained in the summary), so per-task resource usage is not an example we're looking for. I think "per-container averages" is not equal to per-container resource usage. Understanding application's real resource consumption/usage is one of the core use cases for new timeline service at the beginning so I don't think we should rule out anything important here. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612684#comment-14612684 ] Sangjin Lee commented on YARN-3815: --- {quote} The use case here should be obviously. A quick real life example here is Google Borg - cluster management tools (http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf) which aggregate per-task resource usage information for usage-based charging, debugging job and long-term capacity planning. {quote} Thanks [~djp]. What I'm looking for is a little more specific examples. That's why we spent some time during the discussion to define precisely what we mean by "averages". We discovered that there were already two different definitions of the average for gauges. We also ruled out per-container averages (explained in the summary), so per-task resource usage is not an example we're looking for. So as for the moving (but aggregate) average, are there other examples? What we discussed during the meeting (also in the summary) was the total CPU utilization of an app/flow. Other examples, and how they might be useful, or is that pretty much the best example? > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612680#comment-14612680 ] Sangjin Lee commented on YARN-3815: --- bq. This way sounds very clever. In addition, if we need resource consumption at any standpoint or time window (t1 - t2), we can simply do Avg(t2) * t2 - Avg(t1) * t1. This is much better than aggregating value on each stand point when query. Note that we're not proposing to keep the average as a *time series*. So I'm not sure if that is feasible. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612678#comment-14612678 ] Sangjin Lee commented on YARN-3815: --- {quote} We may consider to provide two ways here: - For legacy applications - like MR, AM already have done aggregation on these counters themselves. - For new application to build against YARN after timeline service v2, AM can delegate YARN timeline service to do aggregation instead of do it themselves. Our data model and aggregation mechanism should assure YARN timeline service can aggregate these framework-specif metrics without get predefined. {quote} I think it's a little more complicated than that. If a new YARN application wants to delegate aggregation to the YARN timeline service, it still needs to do at least the following: - add the framework-specific metrics to the YARN container - do *not* add any of those metrics to the YARN application The framework-specific metrics set on the containers would still be transmitted by the AM (not by the node managers). Then, the YARN timeline service could look at *any* container metrics and apply the uniform aggregation rules. Hopefully YARN apps can add metric values to container entities (there should be a natural mapping from unit of work to containers), otherwise it won't work for them... I think it is pretty natural and straightforward for AMs to aggregate and retain values at the app level, but even if they set it at the container level, it could work. On the other hand, if your app wants to own aggregation, then it should not set the metrics on the containers, or it would be done twice. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612651#comment-14612651 ] Anubhav Dhoot commented on YARN-433: LGTM > When RM is catching up with node updates then it should not expire acquired > containers > -- > > Key: YARN-433 > URL: https://issues.apache.org/jira/browse/YARN-433 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Xuan Gong > Attachments: YARN-433.1.patch, YARN-433.2.patch > > > RM expires containers that are not launched within some time of being > allocated. The default is 10mins. When an RM is not keeping up with node > updates then it may not be aware of new launched containers. If the expire > thread fires for such containers then the RM can expire them even though they > may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612648#comment-14612648 ] Junping Du commented on YARN-3815: -- bq. Also, it would be GREAT if you could give a clear and compelling use case (a real life example) on why such support would be crucial. Thanks! The use case here should be obviously. A quick real life example here is Google Borg - cluster management tools (http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf) which aggregate per-task resource usage information for usage-based charging, debugging job and long-term capacity planning. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3866) AM-RM protocol changes to support container resizing
[ https://issues.apache.org/jira/browse/YARN-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-3866: Attachment: YARN-3866.2.patch Attached new patch based on the review comments. 1) Moved most of the PB records test cases to {{TestPBImplRecords}}, and deleted unnecessary test files. 2) Fixed JavaDoc annotation, and improved/added cooments to all public methods. Checked generated Java docs to make sure they look OK 3) Fixed indentation problem. 4) Added {{ContainerStatus}}, {{ContainerStatusPBImpl}} I will fix JavaDoc and indentation issues in other tickets as well. Thanks [~leftnoteasy] for the review. > AM-RM protocol changes to support container resizing > > > Key: YARN-3866 > URL: https://issues.apache.org/jira/browse/YARN-3866 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-3866.1.patch, YARN-3866.2.patch > > > YARN-1447 and YARN-1448 are outdated. > This ticket deals with AM-RM Protocol changes to support container resize > according to the latest design in YARN-1197. > 1) Add increase/decrease requests in AllocateRequest > 2) Get approved increase/decrease requests from RM in AllocateResponse > 3) Add relevant test cases -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics
[ https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612638#comment-14612638 ] Zhijie Shen commented on YARN-3881: --- Once the metrics are ready, we can build YARN/timeline service builtin webUI to show this information, as well as expose it via API, such that third party monitoring like ambari can integrate with it. I think it should be quite flexible. > Writing RM cluster-level metrics > > > Key: YARN-3881 > URL: https://issues.apache.org/jira/browse/YARN-3881 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: metrics.json > > > RM has a bunch of metrics that we may want to write into the timeline backend > to. I attached the metrics.json that I've crawled via > {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to > three groups of metrics: > 1. QueueMetrics > 2. JvmMetrics > 3. ClusterMetrics > The problem is that unlike other metrics belongs to a single application, > these ones belongs to RM or cluster-wide. Therefore, current write path is > not going to work for these metrics because they don't have the associated > user/flow/app context info. We need to rethink of modeling cross-app metrics > and the api to handle them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612629#comment-14612629 ] Sangjin Lee commented on YARN-3815: --- For gauges and their averages and max in particular, [~vinodkv], [~gtCarrera9], [~djp], could you please confirm what I captured in that document is exactly what we want to support? Could you please comment on that? Also, it would be *GREAT* if you could give a clear and compelling use case (a real life example) on why such support would be crucial. Thanks! > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612620#comment-14612620 ] Junping Du commented on YARN-3815: -- bq. app-level aggregation for framework-specific metrics will be done by the AM. I think there is a little misunderstanding on this - just like I mentioned above, AM should/could get relieved from aggregating counters themselves after timeline service v2. Legacy AMs could still push aggregated counters to backend storage though. Others who also sit in the room, any comments here? > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612592#comment-14612592 ] Sangjin Lee commented on YARN-3815: --- Here is my take on what's consensus, what's not, and what's currently out of scope. I may have misread the discussion and your impression/understanding may be different, so please feel free to chime in and comment on this! (consensus or not controversial) - applications table will be split from the main entities table - app-level aggregation for framework-specific metrics will be done by the AM - app-level aggregation for YARN-system container metrics will be done by the per-app timeline collector - real-time aggregation does simple sum for all types of metrics - metrics API will be updated to differentiate gauges and counters (the type information will need to be persisted in the storage) - for gauges, in addition to the simple sum-based aggregation, support average and max - the flow-run table will be created to handle app-to-flow-run ("real-time") aggregation as proposed in the native HBase schema design - auxiliary tables will be implemented as proposed in the native HBase schema design - time-based aggregation (daily, weekly, monthly, etc.) will be done via phoenix tables to enable ad-hoc queries (questions remaining or undecided) - for the average/max support for gauges (see above), confirm that's exactly what we want to support - how to implement app-to-flow-run aggregation for gauges - how to perform the time-based aggregation (mapreduce, using co-processor endpoints, etc.) - how to handle long-running apps for time-based aggregation - considering adopting "null delimiters" (or other phoenix-friendly tools) to support phoenix reading data from the native HBase tables - using flow collectors, user collectors, and queue collectors as means of performing (higher-level) aggregation (out of scope) - support per-container averages for gauges - any aggregation other than time-based aggregation for flows, users, and queues - creating a dependency on the explicit YARN flow API > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612591#comment-14612591 ] Junping Du commented on YARN-3815: -- Thanks [~sjlee0] for nice writeup on the discussions. Looks good for most parts to me. Some comments on app level aggregations: bq. Framework‐specific metrics will be sent to the per‐app collector aggregated by the AM itself. We may consider to provide two ways here: - For legacy applications - like MR, AM already have done aggregation on these counters themselves. - For new application to build against YARN after timeline service v2, AM can delegate YARN timeline service to do aggregation instead of do it themselves. Our data model and aggregation mechanism should assure YARN timeline service can aggregate these framework-specif metrics without get predefined. bq. time average & max: the average multiplied by the elapsed time of the application represents the total resource usage over time. This way sounds very clever. In addition, if we need resource consumption at any standpoint or time window (t1 - t2), we can simply do Avg(t2) * t2 - Avg(t1) * t1. This is much better than aggregating value on each stand point when query. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line
[ https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612553#comment-14612553 ] Junping Du commented on YARN-313: - Sorry for coming on this. [~elgoiri], are you interested in taking on this JIRA and move it forward? If so, I can assign it to you. > Add Admin API for supporting node resource configuration in command line > > > Key: YARN-313 > URL: https://issues.apache.org/jira/browse/YARN-313 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-313-sample.patch, YARN-313-v1.patch, > YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch > > > We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" > to support changes of node's resource specified in a config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line
[ https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-313: Labels: (was: BB2015-05-TBR) > Add Admin API for supporting node resource configuration in command line > > > Key: YARN-313 > URL: https://issues.apache.org/jira/browse/YARN-313 > Project: Hadoop YARN > Issue Type: Sub-task > Components: client >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-313-sample.patch, YARN-313-v1.patch, > YARN-313-v2.patch, YARN-313-v3.patch, YARN-313-v4.patch > > > We should provide some admin interface, e.g. "yarn rmadmin -refreshResources" > to support changes of node's resource specified in a config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3445) Cache runningApps in RMNode for getting running apps on given NodeId
[ https://issues.apache.org/jira/browse/YARN-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612543#comment-14612543 ] Junping Du commented on YARN-3445: -- Thanks for review and comments, [~mingma]! bq. That is around 10M entries. So it should be ok for RM. ApplicationId only contains int (4 bytes) and long (8 bytes) field. Even consider java object header, padding and PB object overhead, should be far less than 100 bytes. Agree that it should be fine even in large scale as mentioned scenario. bq. Do you need synchronizedList in the following list? It looks like the access of runningApplications are protected by RMNodeImpl's readLock and writeLock. Nice catch! Will replace synchronizedList will ArrayList and add some writeLocks (missing in previous patch). > Cache runningApps in RMNode for getting running apps on given NodeId > > > Key: YARN-3445 > URL: https://issues.apache.org/jira/browse/YARN-3445 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Affects Versions: 2.7.0 >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-3445-v2.patch, YARN-3445-v3.1.patch, > YARN-3445-v3.patch, YARN-3445.patch > > > Per discussion in YARN-3334, we need filter out unnecessary collectors info > from RM in heartbeat response. Our propose is to add cache for runningApps in > RMNode, so RM only send collectors for local running apps back. This is also > needed in YARN-914 (graceful decommission) that if no running apps in NM > which is in decommissioning stage, it will get decommissioned immediately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3815: -- Attachment: hbase-schema-proposal-for-aggregation.pdf aggregation-design-discussion.pdf > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf, aggregation-design-discussion.pdf, > hbase-schema-proposal-for-aggregation.pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
[ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612529#comment-14612529 ] Sangjin Lee commented on YARN-3815: --- Some of us ([~gtCarrera9], [~vinodkv], [~djp], [~zjshen], [~vrushalic], and [~sjlee0]) had a face-to-face design discussion on the aggregation. I am going to post the summary of that discussion along with a proposal for an expanded native HBase schema to support aggregation. I believe we are much closer to a consensus on the aggregation design, but some important questions still remain. For the sake of public discussion and inviting more participants and comments, we should follow up here on this JIRA. > [Aggregation] Application/Flow/User/Queue Level Aggregations > > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level > Aggregations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basic > scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_version > and flow > - User level, expect return: aggregated stats for applications submitted by > user > - Queue level, expect return: aggregated stats for applications within the > Queue > Application states is the basic building block for all other level > aggregations. We can provide Flow/User/Queue level aggregated statistics info > based on application states (a dedicated table for application states is > needed which is missing from previous design documents like HBase/Phoenix > schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612511#comment-14612511 ] Jian He commented on YARN-3878: --- ah, sorry, I overlooked. lgtm, thanks ! > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch, YARN-3878.02.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:744) > "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() > [0x7fb989851000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x000700b79430> (a java.lang.Object) >
[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics
[ https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612432#comment-14612432 ] Lei Guo commented on YARN-3881: --- This is an interesting topic, assuming the timeline server provides this support, should Ambari or other monitoring tool to use this for monitoring purpose? If not, what's the scenario to write RM related metrics? > Writing RM cluster-level metrics > > > Key: YARN-3881 > URL: https://issues.apache.org/jira/browse/YARN-3881 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: metrics.json > > > RM has a bunch of metrics that we may want to write into the timeline backend > to. I attached the metrics.json that I've crawled via > {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to > three groups of metrics: > 1. QueueMetrics > 2. JvmMetrics > 3. ClusterMetrics > The problem is that unlike other metrics belongs to a single application, > these ones belongs to RM or cluster-wide. Therefore, current write path is > not going to work for these metrics because they don't have the associated > user/flow/app context info. We need to rethink of modeling cross-app metrics > and the api to handle them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3881) Writing RM cluster-level metrics
[ https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612429#comment-14612429 ] Zhijie Shen commented on YARN-3881: --- IMHO, we need to add an addition API to direct write the cross app metrics (or already aggregated metrics, if you think of these ones are actually the aggregated data of each individual app, such as the counters of submitted/pending/running apps) to the backend, in the separate tables, such as cluster/queue/user tables, and these data don't need to be aggregated any more. > Writing RM cluster-level metrics > > > Key: YARN-3881 > URL: https://issues.apache.org/jira/browse/YARN-3881 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: metrics.json > > > RM has a bunch of metrics that we may want to write into the timeline backend > to. I attached the metrics.json that I've crawled via > {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to > three groups of metrics: > 1. QueueMetrics > 2. JvmMetrics > 3. ClusterMetrics > The problem is that unlike other metrics belongs to a single application, > these ones belongs to RM or cluster-wide. Therefore, current write path is > not going to work for these metrics because they don't have the associated > user/flow/app context info. We need to rethink of modeling cross-app metrics > and the api to handle them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3881) Writing RM cluster-level metrics
[ https://issues.apache.org/jira/browse/YARN-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3881: -- Attachment: metrics.json > Writing RM cluster-level metrics > > > Key: YARN-3881 > URL: https://issues.apache.org/jira/browse/YARN-3881 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: metrics.json > > > RM has a bunch of metrics that we may want to write into the timeline backend > to. I attached the metrics.json that I've crawled via > {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to > three groups of metrics: > 1. QueueMetrics > 2. JvmMetrics > 3. ClusterMetrics > The problem is that unlike other metrics belongs to a single application, > these ones belongs to RM or cluster-wide. Therefore, current write path is > not going to work for these metrics because they don't have the associated > user/flow/app context info. We need to rethink of modeling cross-app metrics > and the api to handle them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3881) Writing RM cluster-level metrics
Zhijie Shen created YARN-3881: - Summary: Writing RM cluster-level metrics Key: YARN-3881 URL: https://issues.apache.org/jira/browse/YARN-3881 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen RM has a bunch of metrics that we may want to write into the timeline backend to. I attached the metrics.json that I've crawled via {{http://localhost:8088/jmx?qry=Hadoop:*}}. IMHO, we need to pay attention to three groups of metrics: 1. QueueMetrics 2. JvmMetrics 3. ClusterMetrics The problem is that unlike other metrics belongs to a single application, these ones belongs to RM or cluster-wide. Therefore, current write path is not going to work for these metrics because they don't have the associated user/flow/app context info. We need to rethink of modeling cross-app metrics and the api to handle them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612342#comment-14612342 ] Varun Saxena commented on YARN-3878: [~jianhe], the test case as such is adequate. I have basically added two assert statements. If first statement is true, and second is false, hang will occur. But as assert statements are there, test would fail before hang occurs. Without core changes of patch, test will fail at second assertion point. But if you remove this second assertion point, hang will occur and test case time out. {code} Assert.assertTrue("Event Queue should have been empty", eventQueue.isEmpty()); Assert.assertTrue("Async Dispatcher should have been drained as event " + "queue is empty", disp.isDrained()); {code} So do you want me to remove this second assertion statement so that test case doesnt fail before hang ? (without core changes). Let me know. > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch, YARN-3878.02.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$C
[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612320#comment-14612320 ] Jian He commented on YARN-3878: --- Hi [~varun_saxena], the test seems not adequate. It doesn't prove the AsyncDispatcher will hang in this case. Could you update the test case to simulate this scenario and will actually hang without the core changes of the patch ? > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch, YARN-3878.02.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:744) > "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() > [0x7fb
[jira] [Commented] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle
[ https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612307#comment-14612307 ] Varun Saxena commented on YARN-3047: Updated a new patch. [~sjlee0], [~zjshen], kindly review > [Data Serving] Set up ATS reader with basic request serving structure and > lifecycle > --- > > Key: YARN-3047 > URL: https://issues.apache.org/jira/browse/YARN-3047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Labels: BB2015-05-TBR > Attachments: Timeline_Reader(draft).pdf, > YARN-3047-YARN-2928.08.patch, YARN-3047-YARN-2928.09.patch, > YARN-3047-YARN-2928.10.patch, YARN-3047.001.patch, YARN-3047.003.patch, > YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, > YARN-3047.02.patch, YARN-3047.04.patch > > > Per design in YARN-2938, set up the ATS reader as a service and implement the > basic structure as a service. It includes lifecycle management, request > serving, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle
[ https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3047: --- Attachment: YARN-3047-YARN-2928.10.patch > [Data Serving] Set up ATS reader with basic request serving structure and > lifecycle > --- > > Key: YARN-3047 > URL: https://issues.apache.org/jira/browse/YARN-3047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Labels: BB2015-05-TBR > Attachments: Timeline_Reader(draft).pdf, > YARN-3047-YARN-2928.08.patch, YARN-3047-YARN-2928.09.patch, > YARN-3047-YARN-2928.10.patch, YARN-3047.001.patch, YARN-3047.003.patch, > YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, > YARN-3047.02.patch, YARN-3047.04.patch > > > Per design in YARN-2938, set up the ATS reader as a service and implement the > basic structure as a service. It includes lifecycle management, request > serving, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3880) Writing more RM side app-level metrics
Zhijie Shen created YARN-3880: - Summary: Writing more RM side app-level metrics Key: YARN-3880 URL: https://issues.apache.org/jira/browse/YARN-3880 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen In YARN-3044, we implemented an analog of metrics publisher for ATS v1. While it helps to write app/attempt/container life cycle events, it really doesn't write as many app-level system metrics that RM are now having. Just list the metrics that I found missing: * runningContainers * memorySeconds * vcoreSeconds * preemptedResourceMB * preemptedResourceVCores * numNonAMContainerPreempted * numAMContainerPreempted Please feel fee to add more into the list if you find it's not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3849: -- Attachment: 0004-YARN-3849.patch Yes [~leftnoteasy] . You are correct, thanks for pointing out. I update the patch. :) > Too much of preemption activity causing continuos killing of containers > across queues > - > > Key: YARN-3849 > URL: https://issues.apache.org/jira/browse/YARN-3849 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.0 >Reporter: Sunil G >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, > 0003-YARN-3849.patch, 0004-YARN-3849.patch > > > Two queues are used. Each queue has given a capacity of 0.5. Dominant > Resource policy is used. > 1. An app is submitted in QueueA which is consuming full cluster capacity > 2. After submitting an app in QueueB, there are some demand and invoking > preemption in QueueA > 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that > all containers other than AM is getting killed in QueueA > 4. Now the app in QueueB is trying to take over cluster with the current free > space. But there are some updated demand from the app in QueueA which lost > its containers earlier, and preemption is kicked in QueueB now. > Scenario in step 3 and 4 continuously happening in loop. Thus none of the > apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612187#comment-14612187 ] Wangda Tan commented on YARN-3849: -- [~sunilg], Thanks for update, but testPreemptionWithVCoreResource has the similar issue. {code} {"100:100", "10:100", "0"}, // used {code} Could you fix it as well? > Too much of preemption activity causing continuos killing of containers > across queues > - > > Key: YARN-3849 > URL: https://issues.apache.org/jira/browse/YARN-3849 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.0 >Reporter: Sunil G >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, > 0003-YARN-3849.patch > > > Two queues are used. Each queue has given a capacity of 0.5. Dominant > Resource policy is used. > 1. An app is submitted in QueueA which is consuming full cluster capacity > 2. After submitting an app in QueueB, there are some demand and invoking > preemption in QueueA > 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that > all containers other than AM is getting killed in QueueA > 4. Now the app in QueueB is trying to take over cluster with the current free > space. But there are some updated demand from the app in QueueA which lost > its containers earlier, and preemption is kicked in QueueB now. > Scenario in step 3 and 4 continuously happening in loop. Thus none of the > apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612159#comment-14612159 ] Sunil G commented on YARN-2004: --- Thank you [~jianhe] for the comments. - bq.Or this method has more responsibility than that ? Yes. We are planning to check for acl's (priority acls) in this method. I was planning to handle that in separate ticket. {noformat} yarn.scheduler.capacity.root...acl=user1,user2 {noformat} This config will be in queue level, and we could restrict certain users to use some high priority. So only a certain users can use high priority, and other wont be able to submit application in that priority. This acl check was planning to add into {{authenticateApplicationPriority}}. - bq.we may merge the two into a single patch ? I will merge these patches together and will upload into YARN-2003. > Priority scheduling support in Capacity scheduler > - > > Key: YARN-2004 > URL: https://issues.apache.org/jira/browse/YARN-2004 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, > 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, > 0006-YARN-2004.patch, 0007-YARN-2004.patch, 0008-YARN-2004.patch, > 0009-YARN-2004.patch, 0010-YARN-2004.patch > > > Based on the priority of the application, Capacity Scheduler should be able > to give preference to application while doing scheduling. > Comparator applicationComparator can be changed as below. > > 1.Check for Application priority. If priority is available, then return > the highest priority job. > 2.Otherwise continue with existing logic such as App ID comparison and > then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612117#comment-14612117 ] Varun Saxena commented on YARN-3051: Ok...Will make the change > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051-YARN-2928.05.patch, YARN-3051-YARN-2928.06.patch, > YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, > YARN-3051.Reader_API_2.patch, YARN-3051.Reader_API_3.patch, > YARN-3051.Reader_API_4.patch, YARN-3051.wip.02.YARN-2928.patch, > YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
[ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612113#comment-14612113 ] Zhijie Shen commented on YARN-3051: --- 2. I meant we store in a CSV file. Thoughts? 3. I think FS impl related config shouldn't be put in api as the impl not supposed to be used by public, but for test purpose. > [Storage abstraction] Create backing storage read interface for ATS readers > --- > > Key: YARN-3051 > URL: https://issues.apache.org/jira/browse/YARN-3051 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Varun Saxena > Attachments: YARN-3051-YARN-2928.003.patch, > YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch, > YARN-3051-YARN-2928.05.patch, YARN-3051-YARN-2928.06.patch, > YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, > YARN-3051.Reader_API_2.patch, YARN-3051.Reader_API_3.patch, > YARN-3051.Reader_API_4.patch, YARN-3051.wip.02.YARN-2928.patch, > YARN-3051.wip.patch, YARN-3051_temp.patch > > > Per design in YARN-2928, create backing storage read interface that can be > implemented by multiple backing storage implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3849: -- Attachment: 0003-YARN-3849.patch Thank you [~leftnoteasy] for the comments. Uploading a patch addressing the issues. Regarding one comment, bq.testPreemptionWithVCoreResource seems not correct, root.used != A.used + b.used {noformat} "root(=[100:200 100:200 100:200 100:200],x=[100:200 100:200 100:200 100:200]);" "-a(=[50:100 100:200 20:40 50:100],x=[50:100 100:200 80:160 50:100]);" + // a "-b(=[50:100 100:200 80:160 50:100],x=[50:100 100:200 20:40 50:100])"; {noformat} Here now root.used = a.used+b.used. Please help to check. > Too much of preemption activity causing continuos killing of containers > across queues > - > > Key: YARN-3849 > URL: https://issues.apache.org/jira/browse/YARN-3849 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.0 >Reporter: Sunil G >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, > 0003-YARN-3849.patch > > > Two queues are used. Each queue has given a capacity of 0.5. Dominant > Resource policy is used. > 1. An app is submitted in QueueA which is consuming full cluster capacity > 2. After submitting an app in QueueB, there are some demand and invoking > preemption in QueueA > 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that > all containers other than AM is getting killed in QueueA > 4. Now the app in QueueB is trying to take over cluster with the current free > space. But there are some updated demand from the app in QueueA which lost > its containers earlier, and preemption is kicked in QueueB now. > Scenario in step 3 and 4 continuously happening in loop. Thus none of the > apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3877: --- Attachment: YARN-3877.01.patch > YarnClientImpl.submitApplication swallows exceptions > > > Key: YARN-3877 > URL: https://issues.apache.org/jira/browse/YARN-3877 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Affects Versions: 2.7.2 >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Minor > Attachments: YARN-3877.01.patch > > > When {{YarnClientImpl.submitApplication}} spins waiting for the application > to be accepted, any interruption during its Sleep() calls are logged and > swallowed. > this makes it hard to interrupt the thread during shutdown. Really it should > throw some form of exception and let the caller deal with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3840: --- Attachment: YARN-3840-5.patch > Resource Manager web ui issue when sorting application by id (with > application having id > ) > > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: LINTE >Assignee: Mohammad Shahid Khan > Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, > YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3846) RM Web UI queue filter is not working
[ https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3846: --- Labels: PatchAvailable (was: ) Not adding any test case Change is only js code. > RM Web UI queue filter is not working > - > > Key: YARN-3846 > URL: https://issues.apache.org/jira/browse/YARN-3846 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0, 2.8.0 >Reporter: Mohammad Shahid Khan >Assignee: Mohammad Shahid Khan > Labels: PatchAvailable > Attachments: YARN-3846.patch, scheduler queue issue.png, scheduler > queue positive behavior.png > > > Click on root queue will show the complete applications > But click on the leaf queue is not filtering the application related to the > the clicked queue. > The regular expression seems to be wrong > {code} > q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';", > {code} > For example > 1. Suppose queue name is b > them the above expression will try to substr at index 1 > q.lastIndexOf(':') = -1 > -1+2= 1 > which is wrong. its should look at the 0 index. > 2. if queue name is ab.x > then it will parse it to .x > but it should be x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3846) RM Web UI queue filter is not working
[ https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3846: --- Attachment: YARN-3846.patch Please review attached patch > RM Web UI queue filter is not working > - > > Key: YARN-3846 > URL: https://issues.apache.org/jira/browse/YARN-3846 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0, 2.8.0 >Reporter: Mohammad Shahid Khan >Assignee: Mohammad Shahid Khan > Attachments: YARN-3846.patch, scheduler queue issue.png, scheduler > queue positive behavior.png > > > Click on root queue will show the complete applications > But click on the leaf queue is not filtering the application related to the > the clicked queue. > The regular expression seems to be wrong > {code} > q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';", > {code} > For example > 1. Suppose queue name is b > them the above expression will try to substr at index 1 > q.lastIndexOf(':') = -1 > -1+2= 1 > which is wrong. its should look at the 0 index. > 2. if queue name is ab.x > then it will parse it to .x > but it should be x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Shahid Khan updated YARN-3840: --- Attachment: YARN-3840-4.patch Attached patch having test cases > Resource Manager web ui issue when sorting application by id (with > application having id > ) > > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: LINTE >Assignee: Mohammad Shahid Khan > Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, > YARN-3840-3.patch, YARN-3840-4.patch > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3878: --- Attachment: YARN-3878.02.patch > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch, YARN-3878.02.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:744) > "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() > [0x7fb989851000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x000700b79430> (a java.lang.Object) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.se
[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3878: --- Attachment: (was: YARN-3878.02.patch) > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:744) > "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() > [0x7fb989851000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x000700b79430> (a java.lang.Object) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop
[jira] [Updated] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop
[ https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3878: --- Attachment: YARN-3878.02.patch [~jianhe], added a test case > AsyncDispatcher can hang while stopping if it is configured for draining > events on stop > --- > > Key: YARN-3878 > URL: https://issues.apache.org/jira/browse/YARN-3878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Varun Saxena >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-3878.01.patch, YARN-3878.02.patch > > > The sequence of events is as under : > # RM is stopped while putting a RMStateStore Event to RMStateStore's > AsyncDispatcher. This leads to an Interrupted Exception being thrown. > # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On > {{serviceStop}}, we will check if all events have been drained and wait for > event queue to drain(as RM State Store dispatcher is configured for queue to > drain on stop). > # This condition never becomes true and AsyncDispatcher keeps on waiting > incessantly for dispatcher event queue to drain till JVM exits. > *Initial exception while posting RM State store event to queue* > {noformat} > 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService > (AbstractService.java:enterState(452)) - Service: Dispatcher entered state > STOPPED > 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) > {noformat} > *JStack of AsyncDispatcher hanging on stop* > {noformat} > "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e > waiting on condition [0x7fb9654e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000700b79250> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:744) > "main" prio=10 tid=0x7fb98000a800 nid=0x49c3 in Object.wait() > [0x7fb989851000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x000700b79430> (a java.lang.Object) > at > org.apache.hadoop
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cntic updated YARN-2681: Attachment: YARN-2681.patch > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: cntic > Labels: BB2015-05-TBR > Fix For: 2.7.0 > > Attachments: HdfsTrafficControl_UML.png, Traffic Control Design.png, > YARN-2681.patch, YARN-2681.patch, YARN-2681.patch > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf > Implementation: http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3508) Prevent processing preemption events on the main RM dispatcher
[ https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611716#comment-14611716 ] Varun Saxena commented on YARN-3508: [~leftnoteasy], updated patch for branch-2.7 > Prevent processing preemption events on the main RM dispatcher > -- > > Key: YARN-3508 > URL: https://issues.apache.org/jira/browse/YARN-3508 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-3508-branch-2.7.01.patch, YARN-3508.002.patch, > YARN-3508.01.patch, YARN-3508.03.patch, YARN-3508.04.patch, > YARN-3508.05.patch, YARN-3508.06.patch > > > We recently saw the RM for a large cluster lag far behind on the > AsyncDispacher event queue. The AsyncDispatcher thread was consistently > blocked on the highly-contended CapacityScheduler lock trying to dispatch > preemption-related events for RMContainerPreemptEventDispatcher. Preemption > processing should occur on the scheduler event dispatcher thread or a > separate thread to avoid delaying the processing of other events in the > primary dispatcher queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3508) Prevent processing preemption events on the main RM dispatcher
[ https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3508: --- Attachment: YARN-3508-branch-2.7.01.patch > Prevent processing preemption events on the main RM dispatcher > -- > > Key: YARN-3508 > URL: https://issues.apache.org/jira/browse/YARN-3508 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-3508-branch-2.7.01.patch, YARN-3508.002.patch, > YARN-3508.01.patch, YARN-3508.03.patch, YARN-3508.04.patch, > YARN-3508.05.patch, YARN-3508.06.patch > > > We recently saw the RM for a large cluster lag far behind on the > AsyncDispacher event queue. The AsyncDispatcher thread was consistently > blocked on the highly-contended CapacityScheduler lock trying to dispatch > preemption-related events for RMContainerPreemptEventDispatcher. Preemption > processing should occur on the scheduler event dispatcher thread or a > separate thread to avoid delaying the processing of other events in the > primary dispatcher queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)