[ 
https://issues.apache.org/jira/browse/FLINK-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben La Monica updated FLINK-10278:
----------------------------------
    Fix Version/s: 1.5.3

> Flink in YARN cluster uses wrong path when looking for Kerberos Keytab
> ----------------------------------------------------------------------
>
>                 Key: FLINK-10278
>                 URL: https://issues.apache.org/jira/browse/FLINK-10278
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.2
>            Reporter: Ben La Monica
>            Priority: Major
>             Fix For: 1.5.3
>
>
> While trying to run Flink in a yarn cluster with more than 1 physical 
> computer in the cluster, the first task manager will start fine, but the 
> second task manager fails to start because it is looking for the kerberos 
> keytab in the location that is on the *FIRST* taskmanager. See below log 
> lines (unrelated lines removed for clarity):
> {code:java}
> 2018-09-01 23:00:34,322 INFO class=o.a.f.yarn.YarnTaskExecutorRunner 
> thread=main Current working/local Directory: 
> /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005
> 2018-09-01 23:00:34,339 INFO class=o.a.f.r.c.BootstrapTools thread=main 
> Setting directories for temporary files to: 
> /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005
> 2018-09-01 23:00:34,339 INFO class=o.a.f.yarn.YarnTaskExecutorRunner 
> thread=main keytab path: 
> /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_000319/krb5.keytab
> 2018-09-01 23:00:34,339 INFO class=o.a.f.yarn.YarnTaskExecutorRunner 
> thread=main YARN daemon is running as: hadoop Yarn client user obtainer: 
> hadoop
> 2018-09-01 23:00:34,343 ERROR class=o.a.f.yarn.YarnTaskExecutorRunner 
> thread=main YARN TaskManager initialization failed.
> org.apache.flink.configuration.IllegalConfigurationException: Kerberos login 
> configuration is invalid; keytab 
> '/mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_000001/krb5.keytab'
>  does not exist
> at 
> org.apache.flink.runtime.security.SecurityConfiguration.validate(SecurityConfiguration.java:139)
> at 
> org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:90)
> at 
> org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:71)
> at 
> org.apache.flink.yarn.YarnTaskExecutorRunner.run(YarnTaskExecutorRunner.java:120)
> at 
> org.apache.flink.yarn.YarnTaskExecutorRunner.main(YarnTaskExecutorRunner.java:73){code}
>  
> You'll notice that the log statement says that the keytab should be located 
> in container 000319:
> /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_{color:#14892c}*000319*{color}/krb5.keytab
> But after I changed the code so that it would show the file that it's 
> actually checking when doing the SecurityConfiguration init it is actually 
> checking container 000001, which is not on the host:
> /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_{color:#d04437}*000001*{color}/krb5.keytab
> This causes the YARN task managers to restart over and over again (which is 
> why we're up to container 319!)
> I'll submit a PR for this fix, though basically it's just moving the 
> initialization of the SecurityConfiguration down 2 lines.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to