Hello,
I am trying to follow the “quickstart” guide on a GKE Autopilot k8s cluster.
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/
I could install the operator (without webhook) without issue ; however, when
running
kubectl create -f
https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.7/examples/basic.yaml
The job does not work because the task manager does not reach the job manager
(maybe a DNS issue?). Is there some special dns/network configuration to
perform in GKE? Has anybody already made it work?
Thanks,
Arnaud
Log in job manager is :
2024-01-12 11:01:56,878 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source (1/2)
(c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_0_2)
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source (2/2)
(c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_1_2)
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map ->
Sink: Print to Std. Out (1/2)
(c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_0_2)
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map ->
Sink: Print to Std. Out (2/2)
(c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_1_2)
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,879 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job 096668d0039ed54215ae334b5d89aa82:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=1}]
2024-01-12 11:01:56,880 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager []
- Received resource requirements from job 096668d0039ed54215ae334b5d89aa82:
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
numberOfRequiredSlots=2}]
2024-01-12 11:01:56,902 INFO
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint
triggering task Source: Custom Source (1/2) of job
096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting
checkpoint. Failure reason: Not all required tasks are currently running..
2024-01-12 11:01:57,014 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need
request 1 new workers, current worker number 0, declared worker number 1
2024-01-12 11:01:57,015 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0,
taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes,
networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939
bytes), numSlots=2}, current pending count: 1.
2024-01-12 11:01:57,016 INFO
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled
external resources: []
2024-01-12 11:01:57,018 INFO org.apache.flink.configuration.Configuration
[] - Config uses fallback configuration key
'kubernetes.service-account' instead of key
'kubernetes.taskmanager.service-account'
2024-01-12 11:01:57,022 INFO
org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating new
TaskManager pod with name basic-example-taskmanager-1-3 and resource <2048,1.0>.
2024-01-12 11:01:57,095 INFO
org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod
basic-example-taskmanager-1-3 is created.
2024-01-12 11:01:57,116 INFO
org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received new
TaskManager pod: basic-example-taskmanager-1-3
2024-01-12 11:01:57,117 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Requested worker basic-example-taskmanager-1-3 with resource spec
WorkerResourceSpec {cpuCores=1.0, taskHeapSize=537.600mb (563714445 bytes),
taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes),
managedMemSize=634.880mb (665719939 bytes), numSlots=2}.
2024-01-12 11:01:58,902 INFO
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint
triggering task Source: Custom Source (1/2) of job
096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting
checkpoint. Failure reason: Not all required tasks are currently running..
(…)
Log in task manager is :
(…)
2024-01-12 11:02:02,229 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-jmx
2024-01-12 11:02:02,232 INFO
org.apache.flink.runtime.state.changelog.StateChangelogStorageLoader [] -
StateChangelogStorageLoader initialized with shortcut names {memory,filesystem}.
2024-01-12 11:02:02,252 INFO
org.apache.flink.runtime.security.modules.HadoopModuleFactory [] - Cannot
create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2024-01-12 11:02:02,325 INFO
org.apache.flink.runtime.security.modules.JaasModule [] - Jaas file
will be created as /tmp/jaas-3174943888264039421.conf.
2024-01-12 11:02:02,334 INFO
org.apache.flink.runtime.security.contexts.HadoopSecurityContextFactory [] -
Cannot install HadoopSecurityContext because Hadoop cannot be found in the
Classpath.
2024-01-12 11:02:02,929 INFO org.apache.flink.configuration.Configuration
[] - Config uses fallback configuration key 'jobmanager.rpc.address'
instead of key 'rest.address'
2024-01-12 11:02:02,939 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils [] - Trying to
select the network interface and address to use by connecting to the leading
JobManager.
2024-01-12 11:02:02,940 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils [] - TaskManager
will try to connect for PT10S before falling back to heuristics
2024-01-12 11:02:05,826 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Trying to connect to address
basic-example.default/100.64.3.37:6123
2024-01-12 11:02:06,027 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [basic-example-taskmanager-1-3/100.64.3.40] with timeout
[200] due to: connect timed out
2024-01-12 11:02:06,079 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [50] due to: connect timed out
2024-01-12 11:02:06,131 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [50] due to: connect timed out
2024-01-12 11:02:06,182 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/127.0.0.1] with timeout [50] due to: connect timed out
2024-01-12 11:02:07,185 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [1000] due to: connect timed out
2024-01-12 11:02:08,187 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/127.0.0.1] with timeout [1000] due to: connect timed out
2024-01-12 11:02:08,287 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Trying to connect to address
basic-example.default/100.64.3.37:6123
2024-01-12 11:02:08,489 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [basic-example-taskmanager-1-3/100.64.3.40] with timeout
[200] due to: connect timed out
2024-01-12 11:02:08,541 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [50] due to: connect timed out
2024-01-12 11:02:08,592 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [50] due to: connect timed out
2024-01-12 11:02:08,643 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/127.0.0.1] with timeout [50] due to: connect timed out
2024-01-12 11:02:09,645 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [1000] due to: connect timed out
2024-01-12 11:02:10,648 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/127.0.0.1] with timeout [1000] due to: connect timed out
2024-01-12 11:02:10,849 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Trying to connect to address
basic-example.default/100.64.3.37:6123
2024-01-12 11:02:11,051 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [basic-example-taskmanager-1-3/100.64.3.40] with timeout
[200] due to: connect timed out
2024-01-12 11:02:11,103 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [50] due to: connect timed out
2024-01-12 11:02:11,155 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [50] due to: connect timed out
2024-01-12 11:02:11,205 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/127.0.0.1] with timeout [50] due to: connect timed out
2024-01-12 11:02:12,208 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/100.64.3.40] with timeout [1000] due to: connect timed out
2024-01-12 11:02:13,210 INFO org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect to [basic-example.default/100.64.3.37:6123]
from local address [/127.0.0.1] with timeout [1000] due to: connect timed out
2024-01-12 11:02:13,211 WARN org.apache.flink.runtime.net.ConnectionUtils
[] - Could not connect to basic-example.default/100.64.3.37:6123.
Selecting a local address using heuristics.
2024-01-12 11:02:13,212 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - TaskManager
will use hostname/address 'basic-example-taskmanager-1-3' (100.64.3.40) for
communication.
2024-01-12 11:02:13,331 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to
start actor system, external address 100.64.3.40:6122, bind address
0.0.0.0:6122.
2024-01-12 11:02:14,832 INFO akka.event.slf4j.Slf4jLogger
[] - Slf4jLogger started
2024-01-12 11:02:14,927 INFO akka.remote.RemoteActorRefProvider
[] - Akka Cluster not in use - enabling unsafe features anyway
because `akka.remote.use-unsafe-remote-features-outside-cluster` has been
enabled.
2024-01-12 11:02:14,928 INFO akka.remote.Remoting
[] - Starting remoting
2024-01-12 11:02:15,252 INFO akka.remote.Remoting
[] - Remoting started; listening on addresses
:[akka.tcp://[email protected]:6122]
2024-01-12 11:02:15,642 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system
started at akka.tcp://[email protected]:6122
2024-01-12 11:02:15,738 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Using working
directory: WorkingDirectory(/tmp/tm_basic-example-taskmanager-1-3)
2024-01-12 11:02:15,826 INFO
org.apache.flink.runtime.metrics.MetricRegistryImpl [] - No metrics
reporter configured, no metrics will be exposed/reported.
2024-01-12 11:02:15,832 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Trying to
start actor system, external address 100.64.3.40:0, bind address 0.0.0.0:0.
2024-01-12 11:02:15,928 INFO akka.event.slf4j.Slf4jLogger
[] - Slf4jLogger started
2024-01-12 11:02:15,937 INFO akka.remote.RemoteActorRefProvider
[] - Akka Cluster not in use - enabling unsafe features anyway
because `akka.remote.use-unsafe-remote-features-outside-cluster` has been
enabled.
2024-01-12 11:02:15,938 INFO akka.remote.Remoting
[] - Starting remoting
2024-01-12 11:02:16,019 INFO akka.remote.Remoting
[] - Remoting started; listening on addresses
:[akka.tcp://[email protected]:43773]
2024-01-12 11:02:16,037 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils [] - Actor system
started at akka.tcp://[email protected]:43773
2024-01-12 11:02:16,118 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
[] - Starting RPC endpoint for
org.apache.flink.runtime.metrics.dump.MetricQueryService at
akka://flink-metrics/user/rpc/MetricQueryService_basic-example-taskmanager-1-3 .
2024-01-12 11:02:16,141 INFO org.apache.flink.runtime.blob.PermanentBlobCache
[] - Created BLOB cache storage directory
/tmp/tm_basic-example-taskmanager-1-3/blobStorage
2024-01-12 11:02:16,148 INFO org.apache.flink.runtime.blob.TransientBlobCache
[] - Created BLOB cache storage directory
/tmp/tm_basic-example-taskmanager-1-3/blobStorage
2024-01-12 11:02:16,216 INFO
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled
external resources: []
2024-01-12 11:02:16,218 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Loading delegation token receivers
2024-01-12 11:02:16,224 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Delegation token receiver hadoopfs loaded and initialized
2024-01-12 11:02:16,225 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Delegation token receiver hbase loaded and initialized
2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-datadog
2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-statsd
2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-slf4j
2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-graphite
2024-01-12 11:02:16,226 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-prometheus
2024-01-12 11:02:16,227 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: external-resource-gpu
2024-01-12 11:02:16,227 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-influx
2024-01-12 11:02:16,227 INFO org.apache.flink.core.plugin.DefaultPluginManager
[] - Plugin loader with ID found, reusing it: metrics-jmx
2024-01-12 11:02:16,228 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Delegation token receivers loaded successfully
2024-01-12 11:02:16,228 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Starting
TaskManager with ResourceID: basic-example-taskmanager-1-3
2024-01-12 11:02:16,254 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerServices [] - Temporary
file directory '/tmp': total 94 GB, usable 88 GB (93.62% usable)
2024-01-12 11:02:16,258 INFO
org.apache.flink.runtime.io.disk.iomanager.IOManager [] - Created a new
FileChannelManager for spilling of task related data to disk (joins, sorting,
...). Used directories:
/tmp/flink-io-d2303d34-47ac-4a6f-a1dd-bcb08211d531
2024-01-12 11:02:16,312 INFO
org.apache.flink.runtime.io.network.netty.NettyConfig [] - NettyConfig
[server address: /0.0.0.0, server port: 0, ssl enabled: false, memory segment
size (bytes): 32768, transport type: AUTO, number of server threads: 2
(manual), number of client threads: 2 (manual), server connect backlog: 0 (use
Netty's default), client connect timeout (sec): 120, send/receive buffer size
(bytes): 0 (use Netty's default)]
2024-01-12 11:02:16,438 INFO
org.apache.flink.runtime.io.network.NettyShuffleServiceFactory [] - Created a
new FileChannelManager for storing result partitions of BLOCKING shuffles. Used
directories:
/tmp/flink-netty-shuffle-b698edd9-5b87-4f17-9442-2190641af033
2024-01-12 11:02:16,820 INFO
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool [] - Allocated 158
MB for network buffer pool (number of memory segments: 5079, bytes per segment:
32768).
2024-01-12 11:02:16,842 INFO
org.apache.flink.runtime.io.network.NettyShuffleEnvironment [] - Starting the
network environment and its components.
2024-01-12 11:02:17,029 INFO
org.apache.flink.runtime.io.network.netty.NettyClient [] - Transport
type 'auto': using EPOLL.
2024-01-12 11:02:17,031 INFO
org.apache.flink.runtime.io.network.netty.NettyClient [] - Successful
initialization (took 188 ms).
2024-01-12 11:02:17,039 INFO
org.apache.flink.runtime.io.network.netty.NettyServer [] - Transport
type 'auto': using EPOLL.
2024-01-12 11:02:17,141 INFO
org.apache.flink.runtime.io.network.netty.NettyServer [] - Successful
initialization (took 108 ms). Listening on SocketAddress /0.0.0.0:42335.
2024-01-12 11:02:17,143 INFO
org.apache.flink.runtime.taskexecutor.KvStateService [] - Starting the
kvState service and its components.
2024-01-12 11:02:17,236 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService
[] - Starting RPC endpoint for
org.apache.flink.runtime.taskexecutor.TaskExecutor at
akka://flink/user/rpc/taskmanager_0 .
2024-01-12 11:02:17,342 INFO
org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job
leader service.
2024-01-12 11:02:17,345 INFO org.apache.flink.runtime.filecache.FileCache
[] - User file cache uses directory
/tmp/flink-dist-cache-3b3d1cb3-3914-4dd5-a403-216680f25c79
2024-01-12 11:02:17,349 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to
ResourceManager
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2024-01-12 11:02:27,441 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not
resolve ResourceManager address
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*,
retrying in 10000 ms: Could not connect to rpc endpoint under address
akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*.
2024-01-12 11:02:37,538 INFO akka.remote.transport.ProtocolStateActor
[] - No response from remote for outbound association. Associate
timed out after [20000 ms].
2024-01-12 11:02:37,546 WARN akka.remote.ReliableDeliverySupervisor
[] - Association with remote system
[akka.tcp://[email protected]:6123] has failed, address is now gated
for [50] ms. Reason: [Association failed with
[akka.tcp://[email protected]:6123]] Caused by: [No response from
remote for outbound association. Associate timed out after [20000 ms].]
(…)
________________________________
L'intégrité de ce message n'étant pas assurée sur internet, la société
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir
l'expéditeur.
The integrity of this message cannot be guaranteed on the Internet. The company
that sent this message cannot therefore be held liable for its content nor
attachments. Any unauthorized use or dissemination is prohibited. If you are
not the intended recipient of this message, then please delete it and notify
the sender.