[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182810#comment-15182810
 ] 

Steve Loughran commented on YARN-4721:
--------------------------------------

this patch, initially, sets up kerberos diagnostics without side effects.

I'd also like to do, after this, an {{ls / }} of the filesystem. Maybe just 
make this another option to run if yarn.resourcemanager.kdiag.enabled=true .  
Against a kerberized FS this would trigger fast negotiation and, if there are 
problems report. (this would have to be done async, obviously). 

The problem with the current —talk-during-renew— process is that it means that 
if any problem surfaces, it doesn't surface until someone submits work. It then 
surfaces as "job submit failed", rather than the more fundamental "your RM 
doesn't have the credentials to talk to HDFS"

This would not be a hard coded binding to HDFS; purely a check that the cluster 
FS is readable by the RM principal

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4721
>                 URL: https://issues.apache.org/jira/browse/YARN-4721
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to