Ethan Li created FLINK-9684:
-------------------------------

             Summary: HistoryServerArchiveFetcher not working properly with 
secure hdfs cluster
                 Key: FLINK-9684
                 URL: https://issues.apache.org/jira/browse/FLINK-9684
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.4.2
            Reporter: Ethan Li


With my current setup, jobmanager and taskmanager are able to talk to hdfs 
cluster (with kerberos setup). However, running history server gets:

 

 
{code:java}
2018-06-27 19:03:32,080 WARN org.apache.hadoop.ipc.Client - Exception 
encountered while connecting to the server : 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name
2018-06-27 19:03:32,085 ERROR 
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher - 
Failed to access job archive location for path 
hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive.
java.io.IOException: Failed on local exception: java.io.IOException: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: 
"openstorm10blue-n2.blue.ygrid.yahoo.com/10.215.79.35"; destination host is: 
"openqe11blue-n2.blue.ygri
d.yahoo.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:515)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1726)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:650)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.listStatus(HadoopFileSystem.java:146)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher$JobArchiveFetcherTask.run(HistoryServerArchiveFetcher.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to 
specify server's Kerberos principal name
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 28 more
{code}
 

 

Changed LOG Level to DEBUG and seeing 

 
{code:java}
2018-06-27 19:03:30,931 INFO 
org.apache.flink.runtime.webmonitor.history.HistoryServer - Enabling SSL for 
the history server.
2018-06-27 19:03:30,931 DEBUG org.apache.flink.runtime.net.SSLUtils - Creating 
server SSL context from configuration
2018-06-27 19:03:31,091 DEBUG org.apache.flink.core.fs.FileSystem - Loading 
extension file systems via services
2018-06-27 19:03:31,094 DEBUG org.apache.flink.core.fs.FileSystem - Added file 
system maprfs:org.apache.flink.runtime.fs.maprfs.MapRFsFactory
2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - 
Cannot find hdfs-default configuration-file path in Flink config.
2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - 
Cannot find hdfs-site configuration-file path in Flink config.
2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - Could 
not find Hadoop configuration via any of the supported methods (Flink 
configuration, environment variables).
2018-06-27 19:03:31,178 DEBUG org.apache.flink.runtime.fs.hdfs.HadoopFsFactory 
- Instantiating for file system scheme hdfs Hadoop File System 
org.apache.hadoop.hdfs.DistributedFileSystem
2018-06-27 19:03:31,829 INFO 
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher - 
Monitoring directory 
hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive
 for archived jobs.
{code}
 

 

The root cause is 

https://github.com/apache/flink/blob/release-1.4.2/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServer.java#L169

 
{code:java}
FileSystem refreshFS = refreshPath.getFileSystem();
{code}
 

The getFileSystem() is being called before 
{code:java}
FileSystem.initialize(xxx){code}
ever happened. 

So it will call

[https://github.com/apache/flink/blob/release-1.4.2/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L388-L390]

 
{code:java}
if (FS_FACTORIES.isEmpty()) {
initialize(new Configuration());
}
{code}
and because the configuration is empty, it won't be able to connect to hdfs 
correctly.

 

A workaround is to set HADOOP_CONF_DIR or HADOOP_HOME environment variables.

 But we should fix this since we have 
{code:java}
fs.hdfs.hadoopconf
{code}
config,  otherwise it will be confusing to users.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to