[jira] [Commented] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management
[ https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411357#comment-17411357 ] Weiwei Yang commented on YARN-10928: Sure, granted the contributor role to [~Weihao Zheng]. Thanks > Support default queue properties of capacity scheduler to simplify > configuration management > --- > > Key: YARN-10928 > URL: https://issues.apache.org/jira/browse/YARN-10928 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weihao Zheng >Assignee: Weihao Zheng >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > There are many user cases that one user owns many queues in his > organization's cluster for different business usages in practice. These > queues often share the same properties, such as minimum-user-limit-percent > and user-limit-factor. Users have to write one property for every queue they > use if they want to use customized these shared properties. Adding default > queue properties for these cases will simplify capacity scheduler's > configuration file and make it easy to adjust queue's common properties. > > CHANGES: > Add two properties as queue's default value in capacity scheduler's > configuration: > * {{yarn.scheduler.capacity.minimum-user-limit-percent}} > * {{yarn.scheduler.capacity.user-limit-factor}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management
[ https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reassigned YARN-10928: -- Assignee: Weihao Zheng > Support default queue properties of capacity scheduler to simplify > configuration management > --- > > Key: YARN-10928 > URL: https://issues.apache.org/jira/browse/YARN-10928 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weihao Zheng >Assignee: Weihao Zheng >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > There are many user cases that one user owns many queues in his > organization's cluster for different business usages in practice. These > queues often share the same properties, such as minimum-user-limit-percent > and user-limit-factor. Users have to write one property for every queue they > use if they want to use customized these shared properties. Adding default > queue properties for these cases will simplify capacity scheduler's > configuration file and make it easy to adjust queue's common properties. > > CHANGES: > Add two properties as queue's default value in capacity scheduler's > configuration: > * {{yarn.scheduler.capacity.minimum-user-limit-percent}} > * {{yarn.scheduler.capacity.user-limit-factor}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059095#comment-17059095 ] Weiwei Yang commented on YARN-9050: --- All done, marked resolved for 3.3.0. Thanks for the efforts [~Tao Yang] ! > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang resolved YARN-9050. --- Hadoop Flags: Reviewed Resolution: Fixed > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059094#comment-17059094 ] Weiwei Yang commented on YARN-9567: --- Thanks [~Tao Yang], looks good. +1 > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, YARN-9567.004.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059090#comment-17059090 ] Weiwei Yang commented on YARN-9538: --- All looking good now. +1. > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch, YARN-9538.004.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059089#comment-17059089 ] Weiwei Yang commented on YARN-9050: --- Thanks [~brahmareddy]. The patch was good but it was stuck there because of some jenkins issues.. let me trigger the jenkins once again. We'll try to get them merged before March 17. Thanks! > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029227#comment-17029227 ] Weiwei Yang commented on YARN-9538: --- Manually triggered the Jenkins job: [https://builds.apache.org/job/PreCommit-YARN-Build/25491/] > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch, YARN-9538.004.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029226#comment-17029226 ] Weiwei Yang commented on YARN-9567: --- Hi [~Tao Yang] Can you please rebase the patch to latest trunk? > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029223#comment-17029223 ] Weiwei Yang commented on YARN-9567: --- Thanks. Looks good. Somehow this latest patch did not trigger the jenkins job, manually triggered [https://builds.apache.org/job/PreCommit-YARN-Build/25490/] > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012075#comment-17012075 ] Weiwei Yang commented on YARN-9567: --- > Not yet, I think it's not a strong requirement which only used for >debugging, we can rarely got a long table about that, and even if we have, it >may have a minor impact for the UI, right? It may cause big usability issues when there are lots of requests. Can we add this support? [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13234295] > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012075#comment-17012075 ] Weiwei Yang edited comment on YARN-9567 at 1/9/20 5:50 PM: --- > Not yet, I think it's not a strong requirement which only used for >debugging, we can rarely got a long table about that, and even if we have, it >may have a minor impact for the UI, right? It may cause big usability issues when there are lots of requests. Can we add this support? was (Author: cheersyang): > Not yet, I think it's not a strong requirement which only used for >debugging, we can rarely got a long table about that, and even if we have, it >may have a minor impact for the UI, right? It may cause big usability issues when there are lots of requests. Can we add this support? [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13234295] > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012069#comment-17012069 ] Weiwei Yang commented on YARN-9538: --- Hi [~Tao Yang] Thanks for the updates. Could you please also check the failures in Jenkins # trailing spaces # hadoop-yarn-site in the patch failed > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011477#comment-17011477 ] Weiwei Yang commented on YARN-9538: --- Hi [~Tao Yang] Few comments CS # The newly added document should be added to the table of content of the page # "Activities have been integrated into the application attempt page, should be shown below the table of outstanding requests when there is any outstanding request" -> "Activities info is available in the application attempt page on RM Web UI, where outstanding requests are aggregated and displayed. RM 1. +The scheduler activities API currently supports Capacity Scheduler and provides a way to get scheduler activities in a single scheduling process, it will trigger recording scheduler activities in next scheduling process and then take last required scheduler activities from cache as the response. The response have hierarchical structure with multiple levels and important scheduling details which are organized by the sequence of scheduling process: -> The scheduler activities Restful API can fetch scheduler activities info recorded in a scheduling cycle. The API returns a message that includes important scheduling activities info. 2. nodeId - specified node ID, if not specified, scheduler will record next scheduling process on any node. -> specified node ID, if not specified, the scheduler will record the scheduling activities info for the next scheduling cycle on all nodes +### Elements of the *Activities* object + +| Item | Data Type | Description | +|: |: |: | +| nodeId | string | The node ID on which scheduler tries to schedule containers. | +| timestamp | long | Timestamp of the activities. | +| dateTime | string | Date time of the activities. | +| diagnostic | string | Top diagnostic of the activities about empty results, unavailable environments, or illegal input parameters, such as "waiting for display", "waiting for the next allocation", "Not Capacity Scheduler", "No node manager running in the cluster", "Got invalid groupBy: xx, valid groupBy types: DIAGNOSTICS" | +| allocations | array of allocations | A collection of allocation objects. | + 3. +| nodeId | string | The node ID on which scheduler tries to schedule containers. | -> The node ID on which the scheduler tries to allocate containers. 4. +| diagnostic | string | Top diagnostic of the activities about empty results, unavailable environments, or illegal input parameters, such as "waiting for display", "waiting for the next allocation", "Not Capacity Scheduler", "No node manager running in the cluster", "Got invalid groupBy: xx, valid groupBy types: DIAGNOSTICS" | Please remove "Not Capacity Scheduler". 5. Please replace all "ids" to "IDs" 6. four node activities will be separated into two groups -> 4 node activities info will be grouped into 2 groups. 7. + Application activities include useful scheduling info for a specified application, the response have hierarchical structure with multiple levels: -> the response has a hierarchical layout with following fields: 8. * **AppActivities** - AppActivities are root structure of application activities within basic information. -> is the root element? 9. +* **Applications** - Allocations are allocation attempts at app level queried from the cache. -> shouldn't here be applications? > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011437#comment-17011437 ] Weiwei Yang commented on YARN-9567: --- hi [~Tao Yang] The screenshots look good. I am not a UI expert, just want to make sure a few cases are covered by the patch # since this is a CS only feature, pls make sure nothing breaks when FS is enabled # does the table support paging? > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10042) Uupgrade grpc-xxx depdencies to 1.26.0
[ https://issues.apache.org/jira/browse/YARN-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000540#comment-17000540 ] Weiwei Yang commented on YARN-10042: +1, it looks good. [~tangzhankun] could you please help to commit this change? Thanks > Uupgrade grpc-xxx depdencies to 1.26.0 > -- > > Key: YARN-10042 > URL: https://issues.apache.org/jira/browse/YARN-10042 > Project: Hadoop YARN > Issue Type: Bug >Reporter: liusheng >Priority: Major > Attachments: YARN-10042.001.patch, > hadoop_build_aarch64_grpc_1.26.0.log, hadoop_build_x86_64_grpc_1.26.0.log, > yarn_csi_tests_aarch64_grpc_1.26.0.log, yarn_csi_tests_x86_64_grpc_1.26.0.log > > > For now, Hadoop YARN use grpc-context, grpc-core, grpc-netty, grpc-protobuf, > grpc-protobuf-lite, grpc-stub and protoc-gen-grpc-java of version 1.15.1, but > the "protoc-gen-grpc-java" cannot support on aarch64 platform. Now the > grpc-java repo has support aarch64 platform and release in 1.26.0 in maven > central. > see: > [https://github.com/grpc/grpc-java/pull/6496] > [https://search.maven.org/search?q=g:io.grpc] > It is better to upgrade the version of grpc-xxx dependencies to 1.26.0 > version. both x86_64 and aarch64 server are building OK accroding to my > testing, please see the attachment, they are: log of building on aarch64, log > of building on x86_64, log of running tests of yarn csi on aarch64, log of > running tests of yarn csi on x86_64. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile
[ https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951984#comment-16951984 ] Weiwei Yang commented on YARN-8737: --- Hi [~Tao Yang] Change LGTM, could you pls submit the patch? > Race condition in ParentQueue when reinitializing and sorting child queues in > the meanwhile > --- > > Key: YARN-8737 > URL: https://issues.apache.org/jira/browse/YARN-8737 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-8737.001.patch > > > Administrator raised a update for queues through REST API, in RM parent queue > is refreshing child queues through calling ParentQueue#reinitialize, > meanwhile, async-schedule threads is sorting child queues when calling > ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen > and throw exception as follow because TimSort does not handle the concurrent > modification of objects it is sorting: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeCollapse(TimSort.java:441) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962) > {noformat} > I think we can add read-lock for > ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the > write-lock will be hold when updating child queues in > ParentQueue#reinitialize. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metric
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reassigned YARN-9838: - Assignee: jiulongzhu > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Assignee: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch, YARN-9838.0002.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2255) YARN Audit logging not added to log4j.properties
[ https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931952#comment-16931952 ] Weiwei Yang commented on YARN-2255: --- I have committed this to trunk, 3.2 and 3.1. 3.0 is skipped because it is EOL according to [https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches]. > YARN Audit logging not added to log4j.properties > > > Key: YARN-2255 > URL: https://issues.apache.org/jira/browse/YARN-2255 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Varun Saxena >Assignee: Aihua Xu >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-2255.1.patch, YARN-2255.patch > > > log4j.properties file which is part of the hadoop package, doesnt have YARN > Audit logging tied to it. This leads to audit logs getting generated in > normal log files. Audit logs should be generated in a separate log file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2255) YARN Audit logging not added to log4j.properties
[ https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931947#comment-16931947 ] Weiwei Yang commented on YARN-2255: --- Sorry for the late response [~aihuaxu], this got slipped away from my list. I am committing this now. > YARN Audit logging not added to log4j.properties > > > Key: YARN-2255 > URL: https://issues.apache.org/jira/browse/YARN-2255 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Varun Saxena >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-2255.1.patch, YARN-2255.patch > > > log4j.properties file which is part of the hadoop package, doesnt have YARN > Audit logging tied to it. This leads to audit logs getting generated in > normal log files. Audit logs should be generated in a separate log file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2255) YARN Audit logging not added to log4j.properties
[ https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927261#comment-16927261 ] Weiwei Yang commented on YARN-2255: --- Thanks [~aihuaxu]. Patch LGTM. Thanks for putting it back on the table, let's get this fixed. I haven't tried this, I assumed you have verified this can work well on your env, correct? > YARN Audit logging not added to log4j.properties > > > Key: YARN-2255 > URL: https://issues.apache.org/jira/browse/YARN-2255 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Varun Saxena >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-2255.1.patch, YARN-2255.patch > > > log4j.properties file which is part of the hadoop package, doesnt have YARN > Audit logging tied to it. This leads to audit logs getting generated in > normal log files. Audit logs should be generated in a separate log file -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922309#comment-16922309 ] Weiwei Yang commented on YARN-8995: --- Also looks good to me, [~Tao Yang], feel free to commit this. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918495#comment-16918495 ] Weiwei Yang commented on YARN-9538: --- Hi [~Tao Yang] Can you please create MD files based on the google doc, once you've done that, you can locally generate html files for preview. > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9050: -- Fix Version/s: 3.3.0 > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918472#comment-16918472 ] Weiwei Yang commented on YARN-9664: --- +1 committing now. Thanks [~Tao Yang] > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch, > YARN-9664.003.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918421#comment-16918421 ] Weiwei Yang commented on YARN-9664: --- UT seems not related to this patch, [~Tao Yang], could you please confirm? Other than that, I am +1 to v3 patch. > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch, > YARN-9664.003.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918260#comment-16918260 ] Weiwei Yang commented on YARN-9664: --- Thanks [~Tao Yang]. For the single placement node, that is confusing, I think we can just remove it. "Initial check: node has been removed from scheduler" "Initial check: node resource is insufficient for minimum allocation" "Queue skipped because node has been reserved" "Queue skipped because node resource is insufficient" and for the locality, "Node skipped because of no off-switch and locality violation" -> "Node skipped because node/rack locality cannot be satisfied" > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917798#comment-16917798 ] Weiwei Yang commented on YARN-9664: --- Hello [~Tao Yang] I just went through the patch. Most of the changes stay in activities package, that's good. *ActivitiesTestUtils* {code:java} getFirstSubNodeFromJson(JSONObject json, String... hierachicalFieldNames) {code} typo: {{hierachicalFieldNames}} -> {{hierarchicalFieldNames}} *ActivitiesUtils* Line 56: I noticed that the 1st filter is to filter out null objects, can we do this like following? {code:java} activityNodes.stream().filter(e -> Objects.nonNull(e))... {code} *ActivityDiagnosticConstant* {code:java} public final static String INIT_CHECK_SINGLE_NODE_REMOVED = "Initial check: " + "single placement node has been removed from scheduler"; {code} what does "single placement node" mean here? {code:java} public final static String QUEUE_SKIPPED_TO_RESPECT_FIFO = "Queue skipped " + "following applications in the queue to respect FIFO of applications"; {code} I think we can remove "in the queue", that seems redundant. {code:java} public final static String NODE_SKIPPED_BECAUSE_OF_NO_OFF_SWITCH_AND_LOCALITY_VIOLATION = "Node skipped because of no off-switch and locality violation"; {code} I am also not quite sure what does this mean, can you please elaborate? *ParentQueue* line 650: is it safe to the check: "if (node != null && !isReserved)" here? Thanks > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917421#comment-16917421 ] Weiwei Yang commented on YARN-9664: --- Just got time to take a look at this. Wow, this is a huge set of changes. [~Tao Yang], have you verified the output of these changes are expected? I will try to go through the changes and hopefully, I can contribute enough review comments. Thx > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911501#comment-16911501 ] Weiwei Yang edited comment on YARN-8995 at 8/20/19 4:08 PM: Hi [~zhuqi]/[~Tao Yang] Thanks for working on this. Patch LGTM, I might be just a little picky on the configuration name, right now it is not straightforward to me. "The interval of queue size (in thousands) for printing the boom queue event type details." How about something like the following for the description, if I understand this correctly: "The threshold used to trigger the logging of event types and counts in RM's main event dispatcher. Default length is 5000, which means RM will print events info when the queue size cumulatively reaches 5000 every time. Such info can be used to reveal what kind of events that RM is stuck at processing mostly, it can help to narrow down certain performance issues." And also, the config name is better to be something like {{yarn.dispatcher.print-events-info.threshold}}, you don't need to use in-thousands here, as several thousand is still human-readable. Does that make sense? Thanks was (Author: cheersyang): Hi [~zhuqi]/[~Tao Yang] Thanks for working on this. Patch LGTM, I might be just a little picky on the configuration name, right now it is not straightforward to me. {noformat} The interval of queue size (in thousands) for printing the boom queue event type details. {noformat} How about something like the following for the description, if I understand this correctly: {noformat} The threshold used to trigger the logging of event types and counts in RM's main event dispatcher. Default length is 5000, which means RM will print events info when the queue size cumulatively reaches 5000 every time. Such info can be used to reveal what kind of events that RM is stuck at processing mostly, it can help to narrow down certain performance issues. {noformat} And also, the config name is better to be something like {{yarn.dispatcher.print-events-info.threshold}}, you don't need to use in-thousands here, as several thousand is still human-readable. Does that make sense? Thanks > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911501#comment-16911501 ] Weiwei Yang commented on YARN-8995: --- Hi [~zhuqi]/[~Tao Yang] Thanks for working on this. Patch LGTM, I might be just a little picky on the configuration name, right now it is not straightforward to me. {noformat} The interval of queue size (in thousands) for printing the boom queue event type details. {noformat} How about something like the following for the description, if I understand this correctly: {noformat} The threshold used to trigger the logging of event types and counts in RM's main event dispatcher. Default length is 5000, which means RM will print events info when the queue size cumulatively reaches 5000 every time. Such info can be used to reveal what kind of events that RM is stuck at processing mostly, it can help to narrow down certain performance issues. {noformat} And also, the config name is better to be something like {{yarn.dispatcher.print-events-info.threshold}}, you don't need to use in-thousands here, as several thousand is still human-readable. Does that make sense? Thanks > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics
[ https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909036#comment-16909036 ] Weiwei Yang commented on YARN-2599: --- LGTM, that minor check-style issue can be fixed during commit. +1. > Standby RM should also expose some jmx and metrics > -- > > Key: YARN-2599 > URL: https://issues.apache.org/jira/browse/YARN-2599 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1 >Reporter: Karthik Kambatla >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-2599.002.patch, YARN-2599.patch > > > YARN-1898 redirects jmx and metrics to the Active. As discussed there, we > need to separate out metrics displayed so the Standby RM can also be > monitored. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9733) Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when subprocess of container dead
[ https://issues.apache.org/jira/browse/YARN-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903612#comment-16903612 ] Weiwei Yang commented on YARN-9733: --- Sure, added you as a contributor, assigned this issue to you. Thx. > Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when > subprocess of container dead > > > Key: YARN-9733 > URL: https://issues.apache.org/jira/browse/YARN-9733 > Project: Hadoop YARN > Issue Type: Bug >Reporter: qian han >Assignee: qian han >Priority: Major > > The method getTotalProcessJiffies only gets jiffies for running processes not > dead processes. > For example, process pid100 and its children pid200 and pid300. > We call getCpuUsagePercent the first time, assume that pid100 has a jiffies > 1000, pid200 2000 and pid300 3000. The totalProcessJiffies1 is 6000. > And We kill pid300. Then we call getCpuUsagePercent the second time, assume > that pid100 has a jiffies 1100, pid200 2200. The totalProcessJiffies2 is 3300. > So we got a cpu usage percent 0. > I would like to fix this bug. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9733) Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when subprocess of container dead
[ https://issues.apache.org/jira/browse/YARN-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reassigned YARN-9733: - Assignee: qian han > Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when > subprocess of container dead > > > Key: YARN-9733 > URL: https://issues.apache.org/jira/browse/YARN-9733 > Project: Hadoop YARN > Issue Type: Bug >Reporter: qian han >Assignee: qian han >Priority: Major > > The method getTotalProcessJiffies only gets jiffies for running processes not > dead processes. > For example, process pid100 and its children pid200 and pid300. > We call getCpuUsagePercent the first time, assume that pid100 has a jiffies > 1000, pid200 2000 and pid300 3000. The totalProcessJiffies1 is 6000. > And We kill pid300. Then we call getCpuUsagePercent the second time, assume > that pid100 has a jiffies 1100, pid200 2200. The totalProcessJiffies2 is 3300. > So we got a cpu usage percent 0. > I would like to fix this bug. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893856#comment-16893856 ] Weiwei Yang commented on YARN-7621: --- [~cane], could you pls help to review [~Tao Yang]'s patch? Just want to cross check. Thanks > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7621: -- Issue Type: Sub-task (was: Improvement) Parent: YARN-9698 > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9698: -- Labels: fs2cs (was: ) > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > > > Key: YARN-9698 > URL: https://issues.apache.org/jira/browse/YARN-9698 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weiwei Yang >Priority: Major > Labels: fs2cs > > We see some users want to migrate from Fair Scheduler to Capacity Scheduler, > this Jira is created as an umbrella to track all related efforts for the > migration, the scope contains > * Bug fixes > * Add missing features > * Migration tools that help to generate CS configs based on FS, validate > configs etc > * Documents > this is part of CS component, the purpose is to make the migration process > smooth. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
Weiwei Yang created YARN-9698: - Summary: [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler Key: YARN-9698 URL: https://issues.apache.org/jira/browse/YARN-9698 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Reporter: Weiwei Yang We see some users want to migrate from Fair Scheduler to Capacity Scheduler, this Jira is created as an umbrella to track all related efforts for the migration, the scope contains * Bug fixes * Add missing features * Migration tools that help to generate CS configs based on FS, validate configs etc * Documents this is part of CS component, the purpose is to make the migration process smooth. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890170#comment-16890170 ] Weiwei Yang commented on YARN-9687: --- This is related to the core resource calculator, it would be good to have [~sunilg] to take a look too. [~sunilg], could you please help to review this? Thx > Queue headroom check may let unacceptable allocation off when using > DominantResourceCalculator > -- > > Key: YARN-9687 > URL: https://issues.apache.org/jira/browse/YARN-9687 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9687.001.patch > > > Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} > is using {{Resources#greaterThanOrEqual}} which internally compare resources > by ratio, when using DominantResourceCalculator, it may let unacceptable > allocations off in some scenarios. > For example: > cluster-resource=<10GB, 10 vcores> > queue-headroom=<2GB, 4 vcores> > required-resource=<3GB, 1 vcores> > In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so > that allocations will be let off in scheduling process but will always be > rejected when committing these proposals. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9682) Wrong log message when finalizing the upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886647#comment-16886647 ] Weiwei Yang commented on YARN-9682: --- Pushed to trunk, cherry-picked to branch-3.2 and branch-3.1. Thanks for the contribution [~kyungwan nam]. > Wrong log message when finalizing the upgrade > - > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9682) Wrong log message when finalizing the upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9682: -- Fix Version/s: 3.1.3 > Wrong log message when finalizing the upgrade > - > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9682) Wrong log message when finalizing the upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9682: -- Fix Version/s: 3.2.1 > Wrong log message when finalizing the upgrade > - > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9682) Wrong log message when finalizing the upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9682: -- Summary: Wrong log message when finalizing the upgrade (was: wrong log message when finalize upgrade) > Wrong log message when finalizing the upgrade > - > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9682) wrong log message when finalize upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886630#comment-16886630 ] Weiwei Yang commented on YARN-9682: --- +1, committing shortly. > wrong log message when finalize upgrade > --- > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881876#comment-16881876 ] Weiwei Yang commented on YARN-9538: --- Hi [~Tao Yang] I just read the v3 doc, it looks pretty good now. I have added my comments in the doc directly, please take a look. Thanks > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877465#comment-16877465 ] Weiwei Yang commented on YARN-7621: --- cc [~leftnoteasy], [~sunilg], [~wilfreds]. This issue is important for users who want to migrate FS to CS. Adding a label to tag it. > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7621: -- Priority: Major (was: Minor) > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7621: -- Labels: fs2cs (was: ) > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9658) Fix UT failures in TestLeafQueue
[ https://issues.apache.org/jira/browse/YARN-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9658: -- Summary: Fix UT failures in TestLeafQueue (was: UT failures in TestLeafQueue) > Fix UT failures in TestLeafQueue > > > Key: YARN-9658 > URL: https://issues.apache.org/jira/browse/YARN-9658 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9658.001.patch > > > In ActivitiesManager, if there's no yarn configuration in mock RMContext, > cleanup interval can't be initialized to 5 seconds by default, causing the > cleanup thread keeps running repeatedly without interval which may bring > problems to mockito framework, it caused OOM in this case, internally many > throwable objects were generated by incomplete mock. > Add configuration for mock RMContext to fix failures in TestLeafQueue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9658) UT failures in TestLeafQueue
[ https://issues.apache.org/jira/browse/YARN-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877464#comment-16877464 ] Weiwei Yang commented on YARN-9658: --- +1 > UT failures in TestLeafQueue > > > Key: YARN-9658 > URL: https://issues.apache.org/jira/browse/YARN-9658 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9658.001.patch > > > In ActivitiesManager, if there's no yarn configuration in mock RMContext, > cleanup interval can't be initialized to 5 seconds by default, causing the > cleanup thread keeps running repeatedly without interval which may bring > problems to mockito framework, it caused OOM in this case, internally many > throwable objects were generated by incomplete mock. > Add configuration for mock RMContext to fix failures in TestLeafQueue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9655: -- Fix Version/s: 2.9.3 > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 2.9.3, 3.1.3 > > Attachments: YARN-9655.branch-2.9.patch, YARN-9655.branch-3.0.patch > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9655: -- Fix Version/s: 3.0.4 > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9655.branch-2.9.patch, YARN-9655.branch-3.0.patch > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876632#comment-16876632 ] Weiwei Yang commented on YARN-9655: --- Thanks [~hunhun], re-opened the issue to trigger jenkins job. > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9655.branch-2.9.patch > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reopened YARN-9655: --- > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9655.branch-2.9.patch > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876620#comment-16876620 ] Weiwei Yang commented on YARN-9655: --- I just pushed this to trunk, cherry-picked to branch-3.2, branch-3.1. Thanks for the contribution [~hunhun]. FederationInterceptor was added in 2.9, does this issue also exist in branch-2.9 and branch-3.0? If they do, then we need to provide a patch for branch-2.9, branch-2 and branch-3.0. [~hunhun], please let me know, thanks. > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9655: -- Fix Version/s: 3.1.3 > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9655: -- Fix Version/s: 3.2.1 > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.3.0, 3.2.1 > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang resolved YARN-9655. --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.3.0 > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Fix For: 3.3.0 > > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875986#comment-16875986 ] Weiwei Yang commented on YARN-9623: --- Hi [~Tao Yang], pls create a new issue to fix this failure. Thanks > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875037#comment-16875037 ] Weiwei Yang commented on YARN-9655: --- Oops. The fix was simple and that makes me ignore there is no UT for this, let me revert the commit for now. [~hunhun] can you help to add a UT to cover this NPE issue? Thanks > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875035#comment-16875035 ] Weiwei Yang commented on YARN-9655: --- +1. committing shortly. > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reassigned YARN-9655: - Assignee: hunshenshi > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874994#comment-16874994 ] Weiwei Yang commented on YARN-9623: --- Pushed to trunk, thanks for the contribution [~Tao Yang]. > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874991#comment-16874991 ] Weiwei Yang commented on YARN-9623: --- +1, committing shortly. > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874571#comment-16874571 ] Weiwei Yang commented on YARN-6629: --- Hi [~aihuaxu], that's correct, this will be included in 2.10. But if you need it in the next 2.9.x release, then we need to backport to branch-2.9. > NPE occurred when container allocation proposal is applied but its resource > requests are removed before > --- > > Key: YARN-6629 > URL: https://issues.apache.org/jira/browse/YARN-6629 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Fix For: 3.1.0, 2.10.0 > > Attachments: YARN-6629.001.patch, YARN-6629.002.patch, > YARN-6629.003.patch, YARN-6629.004.patch, YARN-6629.005.patch, > YARN-6629.006.patch, YARN-6629.branch-2.001.patch > > > I wrote a test case to reproduce another problem for branch-2 and found new > NPE error, log: > {code} > FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) > at > org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) > at > org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) > at org.mockito.internal.MockHandler.handle(MockHandler.java:97) > at > org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > {code} > Reproduce this error in chronological order: > 1. AM started and requested 1 container with schedulerRequestKey#1 : > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests > Added schedulerRequestKey#1 into schedulerKeyToPlacementSets > 2. Scheduler allocatd 1 container for this request and accepted the proposal > 3. AM removed this request > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests --> > AppSchedulingInfo#addToPlacementSets --> > AppSchedulingInfo#updatePendingResources > Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) > 4. Scheduler applied this proposal > CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> > AppSchedulingInfo#allocate > Throw NPE when called > schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, > type, node); -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874185#comment-16874185 ] Weiwei Yang commented on YARN-9655: --- LGTM. Not sure if this can get some folks familiar with this to review. [~hunhun], can you fix the checkstyle issue? It's simple you just need to cut the line less than 80 chars. Thanks > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby
[ https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874181#comment-16874181 ] Weiwei Yang commented on YARN-9642: --- Sorry for getting to this late. It's a good catch, +1. Thanks [~bibinchundatt], [~sunilg]. > AbstractYarnScheduler#clearPendingContainerCache could run even after > transitiontostandby > - > > Key: YARN-9642 > URL: https://issues.apache.org/jira/browse/YARN-9642 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-9642.001.patch, image-2019-06-22-16-05-24-114.png > > > The TimeTask could hold the reference of Scheduler in case of fast switch > over too. > AbstractYarnScheduler should make sure scheduled Timer cancelled on > serviceStop. > Causes memory leak too > !image-2019-06-22-16-05-24-114.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874176#comment-16874176 ] Weiwei Yang commented on YARN-6629: --- hi [~aihuaxu], [~Tao Yang] Feel free to create another Jira for the backport, loop me in and I'll help to review/commit. Thanks. > NPE occurred when container allocation proposal is applied but its resource > requests are removed before > --- > > Key: YARN-6629 > URL: https://issues.apache.org/jira/browse/YARN-6629 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Fix For: 3.1.0, 2.10.0 > > Attachments: YARN-6629.001.patch, YARN-6629.002.patch, > YARN-6629.003.patch, YARN-6629.004.patch, YARN-6629.005.patch, > YARN-6629.006.patch, YARN-6629.branch-2.001.patch > > > I wrote a test case to reproduce another problem for branch-2 and found new > NPE error, log: > {code} > FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) > at > org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) > at > org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) > at org.mockito.internal.MockHandler.handle(MockHandler.java:97) > at > org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > {code} > Reproduce this error in chronological order: > 1. AM started and requested 1 container with schedulerRequestKey#1 : > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests > Added schedulerRequestKey#1 into schedulerKeyToPlacementSets > 2. Scheduler allocatd 1 container for this request and accepted the proposal > 3. AM removed this request > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests --> > AppSchedulingInfo#addToPlacementSets --> > AppSchedulingInfo#updatePendingResources > Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) > 4. Scheduler applied this proposal > CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> > AppSchedulingInfo#allocate > Throw NPE when called > schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, > type, node); -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874168#comment-16874168 ] Weiwei Yang commented on YARN-9623: --- Hi [~Tao Yang] OK, I am fine with that. However, we still need to configuration \{{yarn.resourcemanager.activities-manager.app-activities.max-queue-length}} to be there. If this configuration is set, then the value should be enforced for the queue size and disable the auto-adjustment. Can you add that logic? This is to ensure we have a workaround if the auto-calculation is suboptimal. Hope that makes sense. Thanks > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873966#comment-16873966 ] Weiwei Yang commented on YARN-9623: --- Hi [~Tao Yang] Generally, I think this is a good approach, to have fewer configs. However, the activity manager should be a general service, it should not be depending on CS's configuration, for example, the number of async threads. How about to let it just be {{1.2 * numOfNodes}} for both cases and see how this works? We can continue to tune this after we have more experience to use this in real clusters. Another thing is {{appActivitiesMaxQueueLength}}, do we need to make it atomic because it is being modified in another thread. Thanks > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9451) AggregatedLogsBlock shows wrong NM http port
[ https://issues.apache.org/jira/browse/YARN-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871431#comment-16871431 ] Weiwei Yang commented on YARN-9451: --- Thanks. Patch LGTM, I guess that to use the HTTP port is better for people to find out which NM this is. However, only one thing about the log, this is quite confusing... {noformat} try the nodemanager at yarn-ats-3:45454 {noformat} shouldn't it be something like, {noformat} try to find the container logs in the local directory of nodemanager yarn-ats-3:45454 {noformat} can we fix that too? Thanks > AggregatedLogsBlock shows wrong NM http port > > > Key: YARN-9451 > URL: https://issues.apache.org/jira/browse/YARN-9451 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: Screen Shot 2019-06-20 at 7.49.46 PM.png, > YARN-9451-001.patch, YARN-9451-002.patch > > > AggregatedLogsBlock shows wrong NM http port when aggregated file is not > available. It shows [http://yarn-ats-3:45454|http://yarn-ats-3:45454/] - NM > rpc port instead of http port. > {code:java} > Logs not available for job_1554476304275_0003. Aggregation may not be > complete, Check back later or try the nodemanager at yarn-ats-3:45454 > Or see application log at > http://yarn-ats-3:45454/node/application/application_1554476304275_0003 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition
[ https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869361#comment-16869361 ] Weiwei Yang commented on YARN-9209: --- Thanks [~tarunparimi]. I just committed this to trunk and cherry-picked to branch-3.2. > When nodePartition is not set in Placement Constraints, containers are > allocated only in default partition > -- > > Key: YARN-9209 > URL: https://issues.apache.org/jira/browse/YARN-9209 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9209.001.patch, YARN-9209.002.patch, > YARN-9209.003.patch > > > When application sets a placement constraint without specifying a > nodePartition, the default partition is always chosen as the constraint when > allocating containers. This can be a problem. when an application is > submitted to a queue which has doesn't have enough capacity available on the > default partition. > This is a common scenario when node labels are configured for a particular > queue. The below sample sleeper service cannot get even a single container > allocated when it is submitted to a "labeled_queue", even though enough > capacity is available on the label/partition configured for the queue. Only > the AM container runs. > {code:java}{ > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ] > } > ] > } > } > ] > } > {code} > It runs fine if I specify the node_partition explicitly in the constraints > like below. > {code:java} > { > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ], > "node_partitions": [ > "label" > ] > } > ] > } > } > ] > } > {code} > The problem seems to be because only the default partition "" is considered > when node_partition constraint is not specified as seen in below RM log. > {code:java} > 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator > (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367)) > - Successfully added SchedulingRequest to > app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. > nodePartition= > {code} > However, I think it makes more sense to consider "*" or the > {{default-node-label-expression}} of the queue if configured, when no > node_partition is specified in the placement constraint. Since not specifying > any node_partition should ideally mean we don't enforce placement constraints > on any node_partition. However we are enforcing the default partition instead > now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition
[ https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9209: -- Fix Version/s: 3.2.1 > When nodePartition is not set in Placement Constraints, containers are > allocated only in default partition > -- > > Key: YARN-9209 > URL: https://issues.apache.org/jira/browse/YARN-9209 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9209.001.patch, YARN-9209.002.patch, > YARN-9209.003.patch > > > When application sets a placement constraint without specifying a > nodePartition, the default partition is always chosen as the constraint when > allocating containers. This can be a problem. when an application is > submitted to a queue which has doesn't have enough capacity available on the > default partition. > This is a common scenario when node labels are configured for a particular > queue. The below sample sleeper service cannot get even a single container > allocated when it is submitted to a "labeled_queue", even though enough > capacity is available on the label/partition configured for the queue. Only > the AM container runs. > {code:java}{ > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ] > } > ] > } > } > ] > } > {code} > It runs fine if I specify the node_partition explicitly in the constraints > like below. > {code:java} > { > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ], > "node_partitions": [ > "label" > ] > } > ] > } > } > ] > } > {code} > The problem seems to be because only the default partition "" is considered > when node_partition constraint is not specified as seen in below RM log. > {code:java} > 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator > (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367)) > - Successfully added SchedulingRequest to > app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. > nodePartition= > {code} > However, I think it makes more sense to consider "*" or the > {{default-node-label-expression}} of the queue if configured, when no > node_partition is specified in the placement constraint. Since not specifying > any node_partition should ideally mean we don't enforce placement constraints > on any node_partition. However we are enforcing the default partition instead > now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition
[ https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869342#comment-16869342 ] Weiwei Yang commented on YARN-9209: --- Agree. +1 for this patch, let's get this issue fixed first. For the documentation, feel free to create a new issue to track. Thanks > When nodePartition is not set in Placement Constraints, containers are > allocated only in default partition > -- > > Key: YARN-9209 > URL: https://issues.apache.org/jira/browse/YARN-9209 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9209.001.patch, YARN-9209.002.patch, > YARN-9209.003.patch > > > When application sets a placement constraint without specifying a > nodePartition, the default partition is always chosen as the constraint when > allocating containers. This can be a problem. when an application is > submitted to a queue which has doesn't have enough capacity available on the > default partition. > This is a common scenario when node labels are configured for a particular > queue. The below sample sleeper service cannot get even a single container > allocated when it is submitted to a "labeled_queue", even though enough > capacity is available on the label/partition configured for the queue. Only > the AM container runs. > {code:java}{ > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ] > } > ] > } > } > ] > } > {code} > It runs fine if I specify the node_partition explicitly in the constraints > like below. > {code:java} > { > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ], > "node_partitions": [ > "label" > ] > } > ] > } > } > ] > } > {code} > The problem seems to be because only the default partition "" is considered > when node_partition constraint is not specified as seen in below RM log. > {code:java} > 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator > (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367)) > - Successfully added SchedulingRequest to > app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. > nodePartition= > {code} > However, I think it makes more sense to consider "*" or the > {{default-node-label-expression}} of the queue if configured, when no > node_partition is specified in the placement constraint. Since not specifying > any node_partition should ideally mean we don't enforce placement constraints > on any node_partition. However we are enforcing the default partition instead > now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition
[ https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869275#comment-16869275 ] Weiwei Yang commented on YARN-9209: --- Hi [~tarunparimi] Sorry, it's been a while. My understanding is this patch does fix the problem, when node-label is enabled, and if PC doesn't have a partition specified, then you add queue's default partition. This is the same logic for a request without PC. Correct? Right now we have some limitations to support ANY partition in PC, like [~leftnoteasy] previously mentioned. I think we need to doc this somewhere to setup correct expectation. BTW, the patch looks good to me. Thanks. > When nodePartition is not set in Placement Constraints, containers are > allocated only in default partition > -- > > Key: YARN-9209 > URL: https://issues.apache.org/jira/browse/YARN-9209 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9209.001.patch, YARN-9209.002.patch, > YARN-9209.003.patch > > > When application sets a placement constraint without specifying a > nodePartition, the default partition is always chosen as the constraint when > allocating containers. This can be a problem. when an application is > submitted to a queue which has doesn't have enough capacity available on the > default partition. > This is a common scenario when node labels are configured for a particular > queue. The below sample sleeper service cannot get even a single container > allocated when it is submitted to a "labeled_queue", even though enough > capacity is available on the label/partition configured for the queue. Only > the AM container runs. > {code:java}{ > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ] > } > ] > } > } > ] > } > {code} > It runs fine if I specify the node_partition explicitly in the constraints > like below. > {code:java} > { > "name": "sleeper-service", > "version": "1.0.0", > "queue": "labeled_queue", > "components": [ > { > "name": "sleeper", > "number_of_containers": 2, > "launch_command": "sleep 9", > "resource": { > "cpus": 1, > "memory": "4096" > }, > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "sleeper" > ], > "node_partitions": [ > "label" > ] > } > ] > } > } > ] > } > {code} > The problem seems to be because only the default partition "" is considered > when node_partition constraint is not specified as seen in below RM log. > {code:java} > 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator > (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367)) > - Successfully added SchedulingRequest to > app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. > nodePartition= > {code} > However, I think it makes more sense to consider "*" or the > {{default-node-label-expression}} of the queue if configured, when no > node_partition is specified in the placement constraint. Since not specifying > any node_partition should ideally mean we don't enforce placement constraints > on any node_partition. However we are enforcing the default partition instead > now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed
[ https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869268#comment-16869268 ] Weiwei Yang commented on YARN-9634: --- Hi [~zhuqi] Thanks. Are you suggesting to let parameter \{{yarn.nodemanager.remote-app-log-dir}} support multiple dirs, and use e.g round-robin policy to select dirs? If this is the case, will this break something? Such as history server, timeline-server or log CLI etc, I assume some of these components need to read logs from this location. Please elaborate, thanks. > Make yarn submit dir and log aggregation dir more evenly distributed > > > Key: YARN-9634 > URL: https://issues.apache.org/jira/browse/YARN-9634 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > When the cluster size is large, the dir which user submits the job, and the > dir which container log aggregate, and other information will fill the HDFS > directory, because the HDFS directory has a default storage limit. In > response to this situation, we can change these dirs more distributed, with > some policy to choose, such as hash policy and round robin policy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed
[ https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868547#comment-16868547 ] Weiwei Yang commented on YARN-9634: --- Hi [~zhuqi] what do you mean by the default storage limit, is that the space quota? Can you give an example of how to distribute these dirs, and why that helps? Thanks. > Make yarn submit dir and log aggregation dir more evenly distributed > > > Key: YARN-9634 > URL: https://issues.apache.org/jira/browse/YARN-9634 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > When the cluster size is large, the dir which user submits the job, and the > dir which container log aggregate, and other information will fill the HDFS > directory, because the HDFS directory has a default storage limit. In > response to this situation, we can change these dirs more distributed, with > some policy to choose, such as hash policy and round robin policy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865465#comment-16865465 ] Weiwei Yang commented on YARN-9621: --- Thanks [~Prabhu Joseph], [~pbacsko]. the fix LGTM, committing this shortly. > FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint > on branch-3.1 > --- > > Key: YARN-9621 > URL: https://issues.apache.org/jira/browse/YARN-9621 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell, test >Affects Versions: 3.1.2 >Reporter: Peter Bacsko >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9621-branch-3.1.001.patch, > YARN-9621-branch-3.1.002.patch, YARN-9621-branch-3.1.003.patch, > YARN-9621-branch-3.1.004.patch, YARN-9621-branch-3.1.005.patch > > > Testcase > {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} > seems to constantly fail on branch 3.1. I believe it was introduced by > YARN-9253. > {noformat} > testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager) > Time elapsed: 24.636 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862697#comment-16862697 ] Weiwei Yang commented on YARN-9567: --- Hi [~Tao Yang] I will take a look at this later this week. Will share feedback then. Thank you. > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API
[ https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862658#comment-16862658 ] Weiwei Yang commented on YARN-9578: --- Thanks [~Tao Yang] for the updates, it looks good to me now. +1 for v6 patch, I am going to commit it shortly. Thank you. > Add limit/actions/summarize options for app activities REST API > --- > > Key: YARN-9578 > URL: https://issues.apache.org/jira/browse/YARN-9578 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9578.001.patch, YARN-9578.002.patch, > YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch, > YARN-9578.006.patch > > > Currently all completed activities of specified application in cache will be > returned for application activities REST API. Most results may be redundant > in some scenarios which only need a few latest results, for example, perhaps > only one result is needed to be shown on UI for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9578) Add limit/actions/summarize options for app activities REST API
[ https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9578: -- Hadoop Flags: Incompatible change > Add limit/actions/summarize options for app activities REST API > --- > > Key: YARN-9578 > URL: https://issues.apache.org/jira/browse/YARN-9578 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9578.001.patch, YARN-9578.002.patch, > YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch, > YARN-9578.006.patch > > > Currently all completed activities of specified application in cache will be > returned for application activities REST API. Most results may be redundant > in some scenarios which only need a few latest results, for example, perhaps > only one result is needed to be shown on UI for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API
[ https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861949#comment-16861949 ] Weiwei Yang commented on YARN-9578: --- Thanks [~Tao Yang]. I see, then it's fine, given subList won't have much overhead. For the REST API, please update to the simpler form we have discussed, I can mark this as an incompatible change. That should be OK, we should do it earlier than later. Thanks > Add limit/actions/summarize options for app activities REST API > --- > > Key: YARN-9578 > URL: https://issues.apache.org/jira/browse/YARN-9578 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9578.001.patch, YARN-9578.002.patch, > YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch > > > Currently all completed activities of specified application in cache will be > returned for application activities REST API. Most results may be redundant > in some scenarios which only need a few latest results, for example, perhaps > only one result is needed to be shown on UI for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled
[ https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860157#comment-16860157 ] Weiwei Yang commented on YARN-9598: --- Thanks for bringing this up and the discussions. It looks like the discussion goes diverse somehow. Let's make sure we understand the problem we want to resolve here. If I understand correctly, [~jutia] was observing the issue that re-reservations are made on a single node because the policy always returns the same order. Actually, this is not the only issue, this policy may cause hot-spot node when multiple threads put allocations to same ordered nodes. I think we need to improve the policy, one possible solution like I previously commented, shuffle nodes per score-range. BTW, [~jutia], are you using this policy already in your cluster? The issue [~Tao Yang] raised is also valid, re-reservations were done by a lot of small asks happening on lots of nodes (when the cluster is busy), it will cause big players to be starving. This issue should be reproducible with SLS. I did a quick look at the patch [~Tao Yang] uploaded, but I also have the concern to disable re-reservation. How can we make sure a big container request not getting starved in such case? Maybe a way to improve this is to swap reserved container on NMs, e.g if a container is already reserved on somewhere else, then we can swap this spot with another bigger container that has no reservation yet. Just a random thought. > Make reservation work well when multi-node enabled > -- > > Key: YARN-9598 > URL: https://issues.apache.org/jira/browse/YARN-9598 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, > image-2019-06-10-11-37-44-975.png > > > This issue is to solve problems about reservation when multi-node enabled: > # As discussed in YARN-9576, re-reservation proposal may be always generated > on the same node and break the scheduling for this app and later apps. I > think re-reservation in unnecessary and we can replace it with > LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates > for this app when multi-node enabled. > # Scheduler iterates all nodes and try to allocate for reserved container in > LeafQueue#allocateFromReservedContainer. Here there are two problems: > ** The node of reserved container should be taken as candidates instead of > all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later > scheduler may generate a reservation-fulfilled proposal on another node, > which will always be rejected in FiCaScheduler#commonCheckContainerAllocation. > ** Assignment returned by FiCaSchedulerApp#assignContainers could never be > null even if it's just skipped, it will break the normal scheduling process > for this leaf queue because of the if clause in LeafQueue#assignContainers: > "if (null != assignment) \{ return assignment;}" > # Nodes which have been reserved should be skipped when iterating candidates > in RegularContainerAllocator#allocate, otherwise scheduler may generate > allocation or reservation proposal on these node which will always be > rejected in FiCaScheduler#commonCheckContainerAllocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled
[ https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857545#comment-16857545 ] Weiwei Yang commented on YARN-9598: --- Ping [~jutia], may you help to take a look too? Thanks > Make reservation work well when multi-node enabled > -- > > Key: YARN-9598 > URL: https://issues.apache.org/jira/browse/YARN-9598 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9598.001.patch > > > This issue is to solve problems about reservation when multi-node enabled: > # As discussed in YARN-9576, re-reservation proposal may be always generated > on the same node and break the scheduling for this app and later apps. I > think re-reservation in unnecessary and we can replace it with > LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates > for this app when multi-node enabled. > # Scheduler iterates all nodes and try to allocate for reserved container in > LeafQueue#allocateFromReservedContainer. Here there are two problems: > ** The node of reserved container should be taken as candidates instead of > all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later > scheduler may generate a reservation-fulfilled proposal on another node, > which will always be rejected in FiCaScheduler#commonCheckContainerAllocation. > ** Assignment returned by FiCaSchedulerApp#assignContainers could never be > null even if it's just skipped, it will break the normal scheduling process > for this leaf queue because of the if clause in LeafQueue#assignContainers: > "if (null != assignment) \{ return assignment;}" > # Nodes which have been reserved should be skipped when iterating candidates > in RegularContainerAllocator#allocate, otherwise scheduler may generate > allocation or reservation proposal on these node which will always be > rejected in FiCaScheduler#commonCheckContainerAllocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API
[ https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857544#comment-16857544 ] Weiwei Yang commented on YARN-9578: --- Hi [~Tao Yang] For limit option, I was referring to this code snippet {code} allocations = curAllocations.stream().map(e -> e.filterAllocationAttempts(requestPriorities, allocationRequestIds)) .filter(e -> !e.getAllocationAttempts().isEmpty()) .collect(Collectors.toList()); {code} Can we add a limit here instead of using a subList later? Thanks > Add limit/actions/summarize options for app activities REST API > --- > > Key: YARN-9578 > URL: https://issues.apache.org/jira/browse/YARN-9578 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9578.001.patch, YARN-9578.002.patch, > YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch > > > Currently all completed activities of specified application in cache will be > returned for application activities REST API. Most results may be redundant > in some scenarios which only need a few latest results, for example, perhaps > only one result is needed to be shown on UI for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9590) Correct incompatible, incomplete and redundant activities
[ https://issues.apache.org/jira/browse/YARN-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857541#comment-16857541 ] Weiwei Yang commented on YARN-9590: --- Sure, make sense to me. LGTM, +1. I'll commit shortly. > Correct incompatible, incomplete and redundant activities > - > > Key: YARN-9590 > URL: https://issues.apache.org/jira/browse/YARN-9590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9590.001.patch, YARN-9590.002.patch, > YARN-9590.003.patch > > > Currently some branches in scheduling process may generate incomplete or > duplicate activities, we should fix them to make activities clean. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9590) Correct incompatible, incomplete and redundant activities
[ https://issues.apache.org/jira/browse/YARN-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857513#comment-16857513 ] Weiwei Yang commented on YARN-9590: --- Hi [~Tao Yang] Thanks, good cleanup, and fixes. Just one thing curious. Currently, ActivitiesManager maintains such a map of queues ConcurrentMap> completedAppAllocations and each queue is a \{{ConcurrentLinkedQueue}}. But since what wanted here is a limited-size queue, why don't you use, e.g guava's EvictingQueue? > Correct incompatible, incomplete and redundant activities > - > > Key: YARN-9590 > URL: https://issues.apache.org/jira/browse/YARN-9590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9590.001.patch, YARN-9590.002.patch, > YARN-9590.003.patch > > > Currently some branches in scheduling process may generate incomplete or > duplicate activities, we should fix them to make activities clean. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API
[ https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857495#comment-16857495 ] Weiwei Yang commented on YARN-9578: --- Hi [~Tao Yang] Thanks for the patch. A few comments here 1. ActivitiesManager#getAppActivitiesInfo Since you are supporting a limit number of records here, instead of cutting off from a full list, can it be done while generating the allocations? Is there a default value for this limit? 2. RMWebServices#getAppActivities I am a bit confused by query parameter ACTIONS, it looks like this is always required. But the form is not a very RESTful pattern. Instead, I am expecting the query URL looks like {code:java} /scheduler/app-activities/app-id/get?xxx /scheduler/app-activities/app-id/update?xxx {code} 3. RMConstants#AppActivitiesRequiredAction Can we rename "UPDATE" to "REFRESH"? Thanks > Add limit/actions/summarize options for app activities REST API > --- > > Key: YARN-9578 > URL: https://issues.apache.org/jira/browse/YARN-9578 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9578.001.patch, YARN-9578.002.patch, > YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch > > > Currently all completed activities of specified application in cache will be > returned for application activities REST API. Most results may be redundant > in some scenarios which only need a few latest results, for example, perhaps > only one result is needed to be shown on UI for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9600) Support self-adaption width for columns of containers table on app attempt page
[ https://issues.apache.org/jira/browse/YARN-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856371#comment-16856371 ] Weiwei Yang commented on YARN-9600: --- Thanks [~akhilpb] for the additional review, I'll help to commit this shortly. > Support self-adaption width for columns of containers table on app attempt > page > --- > > Key: YARN-9600 > URL: https://issues.apache.org/jira/browse/YARN-9600 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9600.001.patch, image-2019-06-04-16-45-49-359.png, > image-2019-06-04-16-55-18-899.png > > > When there are outstanding requests showing on app attempt page, the page > will be automatically stretched horizontally, after that, columns of > containers table can't fill the table and left two blank spaces in the > leftmost and the rightmost of this table, as the following picture shows: > !image-2019-06-04-16-45-49-359.png|width=647,height=231! > We can add relative width style (width:100%) for containers table to make > columns self-adaption. > After doing that containers table show as follows: > !image-2019-06-04-16-55-18-899.png|width=645,height=229! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9600) Support self-adaption width for columns of containers table on app attempt page
[ https://issues.apache.org/jira/browse/YARN-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855649#comment-16855649 ] Weiwei Yang commented on YARN-9600: --- Ping [~akhilpb], would you please help to review this patch? Thx > Support self-adaption width for columns of containers table on app attempt > page > --- > > Key: YARN-9600 > URL: https://issues.apache.org/jira/browse/YARN-9600 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9600.001.patch, image-2019-06-04-16-45-49-359.png, > image-2019-06-04-16-55-18-899.png > > > When there are outstanding requests showing on app attempt page, the page > will be automatically stretched horizontally, after that, columns of > containers table can't fill the table and left two blank spaces in the > leftmost and the rightmost of this table, as the following picture shows: > !image-2019-06-04-16-45-49-359.png|width=647,height=231! > We can add relative width style (width:100%) for containers table to make > columns self-adaption. > After doing that containers table show as follows: > !image-2019-06-04-16-55-18-899.png|width=645,height=229! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers
[ https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855277#comment-16855277 ] Weiwei Yang commented on YARN-9580: --- Hi [~Tao Yang] There seems to have issues in the patch for branch-3.2, could you please take a look? > Fulfilled reservation information in assignment is lost when transferring in > ParentQueue#assignContainers > - > > Key: YARN-9580 > URL: https://issues.apache.org/jira/browse/YARN-9580 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9580.001.patch, YARN-9580.branch-3.2.001.patch > > > When transferring assignment from child queue to parent queue, fulfilled > reservation information including fulfilledReservation and > fulfilledReservedContainer in assignment is lost. > When multi-nodes enabled, this lost can raise a problem that allocation > proposal is generated but can't be accepted because there is a check for > fulfilled reservation information in > FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will > always be there and the resource of the node can't be used anymore. > In HB-driven scheduling mode, fulfilled reservation can be allocated via > another calling stack: CapacityScheduler#allocateContainersToNode --> > CapacityScheduler#allocateContainerOnSingleNode --> > CapacityScheduler#allocateFromReservedContainer, in this way assignment can > be generated by leaf queue and directly submitted, I think that's why we > hardly find this problem before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers
[ https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reopened YARN-9580: --- > Fulfilled reservation information in assignment is lost when transferring in > ParentQueue#assignContainers > - > > Key: YARN-9580 > URL: https://issues.apache.org/jira/browse/YARN-9580 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9580.001.patch, YARN-9580.branch-3.2.001.patch > > > When transferring assignment from child queue to parent queue, fulfilled > reservation information including fulfilledReservation and > fulfilledReservedContainer in assignment is lost. > When multi-nodes enabled, this lost can raise a problem that allocation > proposal is generated but can't be accepted because there is a check for > fulfilled reservation information in > FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will > always be there and the resource of the node can't be used anymore. > In HB-driven scheduling mode, fulfilled reservation can be allocated via > another calling stack: CapacityScheduler#allocateContainersToNode --> > CapacityScheduler#allocateContainerOnSingleNode --> > CapacityScheduler#allocateFromReservedContainer, in this way assignment can > be generated by leaf queue and directly submitted, I think that's why we > hardly find this problem before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers
[ https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854701#comment-16854701 ] Weiwei Yang commented on YARN-9580: --- Committed to trunk. [~Tao Yang], can you please provide a patch for branch-3.2 too? > Fulfilled reservation information in assignment is lost when transferring in > ParentQueue#assignContainers > - > > Key: YARN-9580 > URL: https://issues.apache.org/jira/browse/YARN-9580 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9580.001.patch > > > When transferring assignment from child queue to parent queue, fulfilled > reservation information including fulfilledReservation and > fulfilledReservedContainer in assignment is lost. > When multi-nodes enabled, this lost can raise a problem that allocation > proposal is generated but can't be accepted because there is a check for > fulfilled reservation information in > FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will > always be there and the resource of the node can't be used anymore. > In HB-driven scheduling mode, fulfilled reservation can be allocated via > another calling stack: CapacityScheduler#allocateContainersToNode --> > CapacityScheduler#allocateContainerOnSingleNode --> > CapacityScheduler#allocateFromReservedContainer, in this way assignment can > be generated by leaf queue and directly submitted, I think that's why we > hardly find this problem before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers
[ https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854697#comment-16854697 ] Weiwei Yang commented on YARN-9580: --- [~Tao Yang], thanks for the patch, it makes sense to me. +1. > Fulfilled reservation information in assignment is lost when transferring in > ParentQueue#assignContainers > - > > Key: YARN-9580 > URL: https://issues.apache.org/jira/browse/YARN-9580 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9580.001.patch > > > When transferring assignment from child queue to parent queue, fulfilled > reservation information including fulfilledReservation and > fulfilledReservedContainer in assignment is lost. > When multi-nodes enabled, this lost can raise a problem that allocation > proposal is generated but can't be accepted because there is a check for > fulfilled reservation information in > FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will > always be there and the resource of the node can't be used anymore. > In HB-driven scheduling mode, fulfilled reservation can be allocated via > another calling stack: CapacityScheduler#allocateContainersToNode --> > CapacityScheduler#allocateContainerOnSingleNode --> > CapacityScheduler#allocateFromReservedContainer, in this way assignment can > be generated by leaf queue and directly submitted, I think that's why we > hardly find this problem before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854265#comment-16854265 ] Weiwei Yang commented on YARN-9507: --- Committed to trunk, cherry-picked to branch-3.2 and branch-3.1. > Fix NPE in NodeManager#serviceStop on startup failure > - > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9507: -- Fix Version/s: 3.1.3 > Fix NPE in NodeManager#serviceStop on startup failure > - > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9507: -- Fix Version/s: 3.2.1 > Fix NPE in NodeManager#serviceStop on startup failure > - > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854246#comment-16854246 ] Weiwei Yang commented on YARN-9507: --- LGTM, +1. Committing shortly. > Fix NPE in NodeManager#serviceStop on startup failure > - > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852032#comment-16852032 ] Weiwei Yang commented on YARN-9538: --- Hi [~Tao Yang] Thanks for the updates in doc template, it looks much better now. I've made some modifications based on your v2 version. Could you please take a look? Thanks > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8693) Add signalToContainer REST API for RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850619#comment-16850619 ] Weiwei Yang edited comment on YARN-8693 at 5/29/19 8:34 AM: Hi [~Tao Yang] Latest patch seems good to me, warnings in latest jenkins report seems to be irrelevant, I'll commit this shortly. I assume you have generated the doc and verified the format locally, if you haven't done that, please make sure this is done properly. Thanks. was (Author: cheersyang): Hi [~Tao Yang] Latest patch seems good to me, warnings in latest jenkins report seems to be irrelevant, I'll commit this shortly. Thanks. > Add signalToContainer REST API for RMWebServices > > > Key: YARN-8693 > URL: https://issues.apache.org/jira/browse/YARN-8693 > Project: Hadoop YARN > Issue Type: Improvement > Components: restapi >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8693.001.patch, YARN-8693.002.patch, > YARN-8693.003.patch > > > Currently YARN has a RPC command which is "yarn container -signal ID [signal command]>" to signal > OUTPUT_THREAD_DUMP/GRACEFUL_SHUTDOWN/FORCEFUL_SHUTDOWN commands to container. > That is not enough and we need to add signalToContainer REST API for better > management from cluster administrators or management system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org