[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9050:
---------------------------
    Description: 
We have did some usability improvements for scheduler activities based on 
YARN3.1 in our cluster as follows:
1. Not available for multi-thread asynchronous scheduling. App and node 
activites maybe confused when multiple scheduling threads record activites of 
different allocation processes in the same variables like appsAllocation and 
recordingNodesAllocation in ActivitiesManager. I think these variables should 
be thread-local to make activities clear among multiple threads.
2. Incomplete activites for multi-node lookup machanism, since ActivitiesLogger 
will skip recording through {{if (node == null || activitiesManager == null) }} 
when node is null which represents this allocation is for multi-nodes. We need 
support recording activities for multi-node lookup machanism.
3. Current app activites can not meet requirements of diagnostics, for example, 
we can know that node doesn't match request but hard to know why, especially 
when using placement constraints, it's difficult to make a detailed diagnosis 
manually. So I propose to improve the diagnoses of activites, add diagnosis for 
placement constraints check, update insufficient resource diagnosis with 
detailed info (like 'insufficient resource names:[memory-mb]') and so on.
4. Add more useful fields for app activities, in some scenarios we need to 
distinguish different requests but can't locate requests based on app 
activities info, there are some other fields can help to filter what we want 
such as allocation tags. We have added containerPriority, allocationRequestId 
and allocationTags fields in AppAllocation.
5. Filter app activities by key fields, sometimes the results of app activities 
is massive, it's hard to find what we want. We have support filter by 
allocation-tags to meet requirements from some apps, more over, we can take 
container-priority and allocation-request-id as candidates if necessary.
6. Aggragate app activities by diagnoses. For a single allocation process, 
activities still can be massive in a large cluster, we frequently want to know 
why request can't be allocated in cluster, it's hard to check every node 
manually in a large cluster, so that aggragation for app activities by 
diagnoses is neccessary to solve this trouble. We have added groupingType 
parameter for app-activities REST API for this, supports grouping by 
diagnositics and example like this:
 !image-2018-11-23-16-46-38-138.png! 

I think we can have a discuss about these points, useful improvements which can 
be accepted will be added into the patch. Thanks.

  was:
We have did some usability improvements for scheduler activities based on 
YARN3.1 in our cluster as follows:
1. Not available for multi-thread asynchronous scheduling. App and node 
activites maybe confused when multiple scheduling threads record activites of 
different allocation processes in the same variables like appsAllocation and 
recordingNodesAllocation in ActivitiesManager. I think these variables should 
be thread-local to make activities clear between multiple threads.
2. Incomplete activites for multi-node lookup machanism, since ActivitiesLogger 
will skip recording through {{if (node == null || activitiesManager == null) 
return; }} when node is null which represents this allocation is for 
multi-nodes. We need support recording activities for multi-node lookup 
machanism.
3. Current app activites can not meet requirements of diagnostics, for example, 
we can know that node doesn't match request but hard to know why, especially 
when using placement constraints, it's difficult to make a detailed diagnosis 
manually. So I propose to improve the diagnoses of activites, add diagnosis for 
placement constraints check, update insufficient resource diagnosis with 
detailed info (like 'insufficient resource names:[memory-mb]') and so on.
4. Add more useful fields for app activities, in some scenarios we need to 
distinguish different requests but can't locate requests based on app 
activities info, there are some other fields can help to filter what we want 
such as allocation tags. We have added containerPriority, allocationRequestId 
and allocationTags fields in AppAllocation.
5. Filter app activities by key fields, sometimes the results of app activities 
is massive, it's hard to find what we want. We have support filter by 
allocation-tags to meet requirements from some apps, more over, we can take 
container-priority and allocation-request-id as candidates if necessary.
6. Aggragate app activities by diagnoses. For a single allocation process, 
activities still can be massive in a large cluster, we frequently want to know 
why request can't be allocated in cluster, it's hard to check every node 
manually in a large cluster, so that aggragation for app activities by 
diagnoses is neccessary to solve this trouble. We have added groupingType 
parameter for app-activities REST API for this, supports grouping by 
diagnositics and example like this:
 !image-2018-11-23-16-46-38-138.png! 

I think we can have a discuss about these points, useful improvements which can 
be accepted will be added into the patch. Thanks.


> Usability improvements for scheduler activities
> -----------------------------------------------
>
>                 Key: YARN-9050
>                 URL: https://issues.apache.org/jira/browse/YARN-9050
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to