Hello,
<https://stackoverflow.com/posts/76378281/timeline>

I can launch a Flink cluster (version 1.17.x) on my laptop with 1 Job
Manager and 3 Task Managers. The cluster starts, jobs can be submitted
correctly on the localhost (my laptop).

Next I tried to  launch this cluster on 4 VMs - 1 Master VM (for the Job
Manager) and 3 Worker VMs (for Task Managers). I am not using YARN, K8s or
Docker on any of these VMs.

The cluster starts up fine using "${FLINK_HOME}"/bin/start-cluster.sh.
i.e., running this on the command line on the Master VM does the expected -
it starts the job manager on the Master and then starts 1 Task Manager on
each Worker VM. (ssh connectivity between Master and Worker VMs is fine).

However, job submission("${FLINK_HOME}"/bin/flink run myapp.jar) fails with
the following exceptions:

> Caused by: org.apache.flink.util.FlinkException: Could not upload job files.
>     at 
> org.apache.flink.runtime.client.ClientUtils.uploadJobGraphFiles(ClientUtils.java:86)
>     at 
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:195)
>     ... 10 more
> Caused by: java.io.IOException: Could not connect to BlobServer at address 
> localhost/127.0.0.1:37452
>     at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103)
>     at 
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$null$3(JobSubmitHandler.java:199)
>     at 
> org.apache.flink.runtime.client.ClientUtils.uploadJobGraphFiles(ClientUtils.java:82)
>     ... 11 more
> Caused by: java.net.ConnectException: Connection refused (Connection refused)
>     at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
>     at 
> java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
>     at 
> java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
>     at 
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
>     at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>     at java.base/java.net.Socket.connect(Socket.java:615)
>     at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:97)
>     ... 13 more
> ]
>     at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:536)
>     at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:516)
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>     ... 4 more
>
>

Looking at the DEBUG level logs on the Master (Job Manager) and the Workers
(Task Managers) ... it looks like the Task Managers on the Worker VMS are
all registered with the Job Manager on the Master VM (The Job Manager logs
shows the Task Managers). Hence, connectivity and registration from Worker
VMs back to the Master VM (and the job manager) is fine.

The Web UI reflects this too. I can see the Job Manager and 3 Task managers
running on those 4 VMs.

Hence, it is ONLY the job submission that's failing with this "Could not
connect to Blob Server" exception. I suspect I have the configuration
incorrect.

Questions:

   1. Is there a reference implementation of a Multi-VM Flink Cluster (NOT
   on Docker)?
   2. How should taskmanager.*.*.* properties in flink-conf.yaml be
   configured for a multi-VM cluster? In particular, what should the values of
   "taskmanager.bind-host" and "task-manager.host" be on the Master VM?
   3. Does a multi-VM flink cluster need a shared storage directory? If so,
   is there any documentation on configuring this?
   4. From reading various Flink enhancement requests and mailing list
   posts, I found that the Blob Server needs to be accessible from the Task
   Managers on an address that's external facing. Note from the exception
   trace above that the attempt to connect to the BlobServer is being made on
   "localhost/127.0.0.1:37452". I cannot find any configuration item that
   allows me to set the host or bind-host for the blobserver. How is this to
   be done?


Please advise. Thanks in advance for your responses.

Reply via email to