[ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988640#comment-14988640
 ] 

Wangda Tan commented on YARN-3946:
----------------------------------

Hi [~Naganarasimha], 
Thanks for working on this, general idea of this approach looks good, few 
suggestions about what to show: 
- AM launch diagnostics should have an intial value after added to scheduler:
For unmanaged AM, it should be "User launched the Application Master since it's 
unmanaged"
For managed AM, it should be "Added to scheduler, waiting to be scheduled" with 
some general suggestions about configurations to look at, such as user-limit, 
am-percent, queue-limit, etc.
- Loop all applications when queue exceeds limit is too costly. I'd prefer to 
do nothing when this happens.
- After application moved to activated state, if the application is traversed 
by scheduler but cannot allocate any resource, you may put something like 
"Trying to allocate to AM on node=x, etc.". After YARN-4091 we should be able 
to get more detailed information about why this happened.
- Not caused by your patch, isWaitingForAMContainer checks if master container 
created, you may also need to check if application is in recover state or not. 
Because AM could contact to RM before AM container recovered by RM.
- Similar to above, you may need to put diagnostic message when AM is 
recovering by RM
- After AM launched, diag could be something like "AM is launched", which will 
be better than empty text.

Regarding to implementation:
- Since RMAppAttempt and SchedulerApplicationAttempt has 1 to 1 relationship, 
we can save a reference to RMAppAttemt in SchedulerApplicationAttempt, which 
could avoid getting it from {{RMContext.getRMApps()...}}
- Since String is immutable, amLaunchDiagnostics could be violate so we don't 
need acquire locks.
- Suggest to add to REST API / web UI together with this patch if changes are 
not complex.

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---------------------------------------------------------------------------
>
>                 Key: YARN-3946
>                 URL: https://issues.apache.org/jira/browse/YARN-3946
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sumit Nigam
>            Assignee: Naganarasimha G R
>         Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic 
> message.png
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to