[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2019-01-16 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743796#comment-16743796
 ] 

Tao Yang commented on YARN-9050:


Hi, [~cheersyang].
As a footnote, this change will only record one allocation activity instead of 
multiple allocation activites when allocating multiple contianers per node 
heartbeat. That's because the start/finish points will be moved from 
CapacityScheduler#nodeUpdate. Though it will make a difference between past and 
future, I think it's a better solution for all scenarios, we can elaborate this 
change in document of RM REST API. Right?

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2019-01-16 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743735#comment-16743735
 ] 

Tao Yang commented on YARN-9050:


Thanks [~cheersyang] for the review.
{quote}
Will this add a lot of if-else? 
{quote}
No, I think we just need to modify some if-else conditions(e.g. "node != null" 
constrant should be removed from if condition when recording activites in 
ActivitiesLogger) instead of adding if-else blocks. 
{quote}
Is it possible to make it transparent working for both HB based lookup or 
multi-node lookup?
{quote}
Sure, we can take CapacityScheduler#allocateContainersToNode as the unified 
entrance and exit, as described in the second changes to support for scheduler 
activities——Place the start/finish points of scheduler activities in front 
of/after the allocation based on single node (input node is a real node) or 
multiple nodes (input node is ActivitiesManager#MULTI_NODES_AGENT) in 
CapacityScheduler#allocateContainersToNode instead of 
CapacityScheduler#nodeUpdate, to expand the applicable scenarios via unified 
entrance and exit. 

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2019-01-15 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743634#comment-16743634
 ] 

Weiwei Yang commented on YARN-9050:
---

Hi [~Tao Yang]

I just went through the doc, thanks for providing a lot of details : ).

It makes sense to fix the code and let it be able to run on both modes.

One question though
{quote}Add a fake node named MULTI_NODES_AGENT in ActivitiesManager to 
represent multiple nodes and relax restrictions(only for non-null node now) on 
scheduler activities in ActivitiesLogger to support for multi-node lookup 
mechanism.
{quote}
Will this add a lot of if-else? Is it possible to make it transparent working 
for both HB based lookup or multi-node lookup?

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2019-01-09 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738219#comment-16738219
 ] 

Tao Yang commented on YARN-9050:


Hi [~leftnoteasy], [~cheersyang] 

Design doc is attached 
[here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5],
 please help to review in your free time. Thanks!

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2018-12-05 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710263#comment-16710263
 ] 

Weiwei Yang commented on YARN-9050:
---

Hi [~Tao Yang], thanks for creating this, certainly very useful with these 
improvements, and also the documentation :D, thanks!

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2018-12-03 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707553#comment-16707553
 ] 

Wangda Tan commented on YARN-9050:
--

[~Tao Yang], make sense to me. Once you figured out details, I can help with 
reviews, etc.

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2018-12-03 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707010#comment-16707010
 ] 

Tao Yang commented on YARN-9050:


Thanks [~leftnoteasy] for your support.
I would like to work on this later and initially want to split these 
improvements into 3 sub-tasks:
(1) Support multi-thread asynchronous scheduling for scheduler activities 
(include point 1)
(2) Improve diagnostics / fields / scenarios for scheduler activities (include 
points 2, 3, 4)
(3) Support filter and aggregation for scheduler activities REST API (include 
point 5, 6)
I will keep attention on the overhead and not bring unnecessary complexity. In 
our clusters, there are sevaral extenal systems which query data through YARN 
REST API and have no usage scenarios for web ui or cli, but I would like to 
support web ui or cli if they are in demand.

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear between multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) return; }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2018-11-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703553#comment-16703553
 ] 

Wangda Tan commented on YARN-9050:
--

[~Tao Yang], thanks for filing the JIRA.

The all issues you mentioned are valid to me, if you have any cycles to do such 
improvements, please convert this to umbrella and we can help with patch 
reviews. 

My bottomline is try to lower overhead of the activities recording as much as 
possible when it is not recording. And also if you have any ideas about make 
the result can be easier accessed by users, such as via web ui / cli, etc. it 
gonna be super helpful.

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear between multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) return; }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities

2018-11-28 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701600#comment-16701600
 ] 

Tao Yang commented on YARN-9050:


cc: [~cheersyang], [~leftnoteasy], [~sunil.g].  I would be interested in your 
thoughts on this issue. Thanks.

> Usability improvements for scheduler activities
> ---
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear between multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) return; }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org