[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738219#comment-16738219 ]
Tao Yang commented on YARN-9050: -------------------------------- Hi [~leftnoteasy], [~cheersyang] Design doc is attached [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5], please help to review in your free time. Thanks! > Usability improvements for scheduler activities > ----------------------------------------------- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activites maybe confused when multiple scheduling threads record activites of > different allocation processes in the same variables like appsAllocation and > recordingNodesAllocation in ActivitiesManager. I think these variables should > be thread-local to make activities clear among multiple threads. > 2. Incomplete activites for multi-node lookup machanism, since > ActivitiesLogger will skip recording through {{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup machanism. > 3. Current app activites can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activites, add diagnosis for placement constraints check, update insufficient > resource diagnosis with detailed info (like 'insufficient resource > names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggragate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggragation for app activities by > diagnoses is neccessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnositics and example like this: > !image-2018-11-23-16-46-38-138.png! > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org