[ 
https://issues.apache.org/jira/browse/FLINK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451193#comment-17451193
 ] 

chenqizhu edited comment on FLINK-25099 at 11/30/21, 3:41 PM:
--------------------------------------------------------------

When I ignore setting the default.fs=flinkcluster and specifying the checkpoint 
path to hdfs://flinkcluster/xxxxx,
The job could not run properly, the status was always INITIALIZING(It seems 
that the jobmanger cannot be started, but I'm not sure why)
Changing the checkpoint path to hdfs:///xxxxx and everything works fine(It 
obviously uses the default HDFS)[~zuston]

The following is jobmanager.log

{code:java}
2021-11-30 23:32:44,345 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  
           [] - Starting RPC endpoint for 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at 
akka://flink/user/rpc/resourcemanager_0 .
2021-11-30 23:32:44,406 INFO  
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Starting DefaultLeaderElectionService with 
ZooKeeperLeaderElectionDriver{leaderPath='/leader/dispatcher_lock'}.
2021-11-30 23:32:44,407 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Starting the resource manager.
2021-11-30 23:32:44,408 INFO  
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with 
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}.
2021-11-30 23:32:44,409 INFO  
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Starting DefaultLeaderRetrievalService with 
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/dispatcher_lock'}.
2021-11-30 23:32:44,409 INFO  
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - 
DefaultDispatcherRunner was granted leadership with leader id 
8d5e0cf1-06da-4648-bd89-f2949356902f. Creating new DispatcherLeaderProcess.
2021-11-30 23:32:44,414 INFO  
org.apache.flink.runtime.dispatcher.runner.JobDispatcherLeaderProcess [] - 
Start JobDispatcherLeaderProcess.
2021-11-30 23:32:44,421 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  
           [] - Starting RPC endpoint for 
org.apache.flink.runtime.dispatcher.MiniDispatcher at 
akka://flink/user/rpc/dispatcher_1 .
2021-11-30 23:32:44,449 INFO  
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Starting DefaultLeaderElectionService with 
ZooKeeperLeaderElectionDriver{leaderPath='/leader/cad84e92fb8ac17daf839af61fb8f9ae/job_manager_lock'}.
2021-11-30 23:32:44,498 WARN  akka.remote.transport.netty.NettyTransport        
           [] - Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: flink7/10.21.0.7:26635
2021-11-30 23:32:44,499 WARN  akka.remote.ReliableDeliverySupervisor            
           [] - Association with remote system [akka.tcp://flink@flink7:26635] 
has failed, address is now gated for [50] ms. Reason: [Association failed with 
[akka.tcp://flink@flink7:26635]] Caused by: [java.net.ConnectException: 
Connection refused: flink7/10.21.0.7:26635]
2021-11-30 23:32:44,504 INFO  
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] - Failing 
over to rm2
2021-11-30 23:32:44,547 INFO  org.apache.flink.yarn.YarnResourceManagerDriver   
           [] - Recovered 0 containers from previous attempts ([]).
2021-11-30 23:32:44,547 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Recovered 0 workers from previous attempt.
2021-11-30 23:32:44,584 WARN  akka.remote.transport.netty.NettyTransport        
           [] - Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: flink7/10.21.0.7:26635
2021-11-30 23:32:44,585 WARN  akka.remote.ReliableDeliverySupervisor            
           [] - Association with remote system [akka.tcp://flink@flink7:26635] 
has failed, address is now gated for [50] ms. Reason: [Association failed with 
[akka.tcp://flink@flink7:26635]] Caused by: [java.net.ConnectException: 
Connection refused: flink7/10.21.0.7:26635]
2021-11-30 23:32:44,601 INFO  org.apache.hadoop.conf.Configuration              
           [] - resource-types.xml not found
2021-11-30 23:32:44,602 INFO  
org.apache.hadoop.yarn.util.resource.ResourceUtils           [] - Unable to 
find 'resource-types.xml'.
2021-11-30 23:32:44,615 INFO  
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []
2021-11-30 23:32:44,620 INFO  
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Upper bound 
of the thread pool size is 500
2021-11-30 23:32:44,623 INFO  
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Starting DefaultLeaderElectionService with 
ZooKeeperLeaderElectionDriver{leaderPath='/leader/resource_manager_lock'}.
2021-11-30 23:32:44,626 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
ResourceManager akka.tcp://flink@flink26:30734/user/rpc/resourcemanager_0 was 
granted leadership with fencing token bceb6779160f5ffb00c5b73c156844a0
End of LogType:jobmanager.log.This log file belongs to a running container 
(container_e19_1637144069883_16539_09_000001) and so may not be complete.
*******************************************************************************
{code}









was (Author: libra_816):
When I ignore setting the default.fs=flinkcluster and specifying the checkpoint 
path to hdfs://flinkcluster/xxxxx,
The job could not run properly, the status was always INITIALIZING(It seems 
that the jobmanger cannot be started, but I'm not sure why)
Changing the checkpoint path to hdfs:///xxxxx and everything works fine(It 
obviously uses the default HDFS)[~zuston]




> flink on yarn Accessing two HDFS Clusters
> -----------------------------------------
>
>                 Key: FLINK-25099
>                 URL: https://issues.apache.org/jira/browse/FLINK-25099
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, FileSystems, Runtime / State Backends
>    Affects Versions: 1.13.3
>         Environment: flink : 1.13.3
> hadoop : 3.3.0
>            Reporter: chenqizhu
>            Priority: Major
>         Attachments: flink-chenqizhu-client-hdfsn21n163.log
>
>
> Flink version 1.13 supports configuration of Hadoop properties in 
> flink-conf.yaml via flink.hadoop.*. There is A requirement to write 
> checkpoint to HDFS with SSDS (called cluster B) to speed checkpoint writing, 
> but this HDFS cluster is not the default HDFS in the flink client (called 
> cluster A by default). Yaml is configured with nameservices for cluster A and 
> cluster B, which is similar to HDFS federated mode.
> The configuration is as follows:
>  
> {code:java}
> flink.hadoop.dfs.nameservices: ACluster,BCluster
> flink.hadoop.fs.defaultFS: hdfs://BCluster
> flink.hadoop.dfs.ha.namenodes.ACluster: nn1,nn2
> flink.hadoop.dfs.namenode.rpc-address.ACluster.nn1: 10.xxxx:9000
> flink.hadoop.dfs.namenode.http-address.ACluster.nn1: 10.xxxx:50070
> flink.hadoop.dfs.namenode.rpc-address.ACluster.nn2: 10.xxxxxx:9000
> flink.hadoop.dfs.namenode.http-address.ACluster.nn2: 10.xxxxxx:50070
> flink.hadoop.dfs.client.failover.proxy.provider.ACluster: 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> flink.hadoop.dfs.ha.namenodes.BCluster: nn1,nn2
> flink.hadoop.dfs.namenode.rpc-address.BCluster.nn1: 10.xxxxxx:9000
> flink.hadoop.dfs.namenode.http-address.BCluster.nn1: 10.xxxxxx:50070
> flink.hadoop.dfs.namenode.rpc-address.BCluster.nn2: 10.xxxxxx:9000
> flink.hadoop.dfs.namenode.http-address.BCluster.nn2: 10.xxxxx:50070
> flink.hadoop.dfs.client.failover.proxy.provider.BCluster: 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> {code}
>  
> However, an error occurred during the startup of the job, which is reported 
> as follows:
> (change configuration items to A flink local client default HDFS cluster, the 
> operation can be normal boot:  flink.hadoop.fs.DefaultFS: hdfs: / / ACluster)
> {noformat}
> Caused by: BCluster
> java.net.UnknownHostException: BCluster
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:448)
> at 
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:139)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:374)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:308)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:184)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3414)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:158)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3474)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3442)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
> at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:270)
> at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:68)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:415)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:412)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:412)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:247)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:240)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:228)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){noformat}
> Is there a solution to the above problems? The pain point is that Flink can 
> access two HDFS clusters, preferably through the configuration of Flink-conf. 
> yaml.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to