Hi Yang Wang 先分享下我这边的环境版本
kubernetes:1.17.4. CNI: weave 1 2 3 是我的一些疑惑 4 是JM日志 1. 去掉taskmanager-query-state-service.yaml后确实不行 nslookup kubectl exec -it busybox2 -- /bin/sh / # nslookup 10.47.96.2 Server: 10.96.0.10 Address: 10.96.0.10:53 ** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN 2. Flink1.11和Flink1.10 detail subtasks taskmanagers xxx x 这行 1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?) 3. coredns是否特殊配置? 在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置? 4. time out时候的JM日志如下: 2020-07-23 13:53:00,228 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0 was granted leadership with fencing token 00000000000000000000000000000000 2020-07-23 13:53:00,232 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_1 . 2020-07-23 13:53:00,233 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager. 2020-07-23 13:53:03,472 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696 (akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at ResourceManager 2020-07-23 13:53:03,777 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4 (akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at ResourceManager 2020-07-23 13:53:03,787 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4 (akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at ResourceManager 2020-07-23 13:53:04,044 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2 (akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at ResourceManager 2020-07-23 13:53:04,099 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330 (akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at ResourceManager 2020-07-23 13:53:04,146 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8 (akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at ResourceManager 2020-07-23 13:55:44,220 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest). 2020-07-23 13:55:44,222 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest). 2020-07-23 13:55:44,251 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_2 . 2020-07-23 13:55:44,260 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Initializing job JobTest (99a030d0e3f428490a501c0132f27a56). 2020-07-23 13:55:44,278 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off time strategy NoRestartBackoffTimeStrategy for JobTest (99a030d0e3f428490a501c0132f27a56). 2020-07-23 13:55:44,319 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Running initialization on master for job JobTest (99a030d0e3f428490a501c0132f27a56). 2020-07-23 13:55:44,319 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Successfully ran initialization on master in 0 ms. 2020-07-23 13:55:44,428 INFO org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 25 ms 2020-07-23 13:55:44,437 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Loading state backend via factory org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory 2020-07-23 13:55:44,456 INFO org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using predefined options: DEFAULT. 2020-07-23 13:55:44,457 INFO org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using default options factory: DefaultConfigurableOptionsFactory{configuredOptions={}}. 2020-07-23 13:55:44,466 WARN org.apache.flink.runtime.util.HadoopUtils [] - Could not find Hadoop configuration via any of the supported methods (Flink configuration, environment variables). 2020-07-23 13:55:45,276 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533 for JobTest (99a030d0e3f428490a501c0132f27a56). 2020-07-23 13:55:45,280 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2. 2020-07-23 13:55:45,286 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy] 2020-07-23 13:55:45,436 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] 2020-07-23 13:55:45,436 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] 2020-07-23 13:55:45,436 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] 2020-07-23 13:55:45,437 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e559485ea7b0b7e17367816882538d90}] 2020-07-23 13:55:45,437 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}] 2020-07-23 13:55:45,437 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{582a86197884206652dff3aea2306bb3}] 2020-07-23 13:55:45,437 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}] 2020-07-23 13:55:45,437 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}] 2020-07-23 13:55:45,438 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c35033d598a517acc108424bb9f809fb}] 2020-07-23 13:55:45,438 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}] 2020-07-23 13:55:45,438 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}] 2020-07-23 13:55:45,487 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Connecting to ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000) 2020-07-23 13:55:45,492 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Resolved ResourceManager address, beginning registration 2020-07-23 13:55:45,493 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 13:55:45,499 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 13:55:45,501 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000. 2020-07-23 13:55:45,501 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,502 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id d420d08bf2654d9ea76955c70db18b69. 2020-07-23 13:55:45,502 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,503 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,503 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,503 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,503 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,503 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,503 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,514 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id fce526bbe3e1be91caa3e4b536b20e35. 2020-07-23 13:55:45,514 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,514 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,515 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,517 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id 18ac7ec802ebfcfed8c05ee9324a55a4. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id 7ec76cbe689eb418b63599e90ade19be. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,518 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,519 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-07-23 13:55:45,519 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id b78837a29b4032924ac25be70ed21a3c. 2020-07-23 13:58:18,037 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.47.96.2, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:22,192 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.34.64.14, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:22,358 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.34.128.9, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:24,562 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.32.160.6, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:25,487 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.38.64.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:27,636 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.42.160.6, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:27,767 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.43.64.12, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:29,651 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b timed out. 2020-07-23 13:58:29,651 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56 from the resource manager. 2020-07-23 13:58:29,854 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.39.0.8, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:33,623 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.35.0.10, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:35,756 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.36.32.8, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 13:58:36,694 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - No hostname could be resolved for the IP address 10.42.128.6, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted. 2020-07-23 14:01:17,814 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection 83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b timed out.. 2020-07-23 14:01:17,815 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Connecting to ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000) 2020-07-23 14:01:17,816 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Resolved ResourceManager address, beginning registration 2020-07-23 14:01:17,816 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 14:01:17,836 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: host_relation -> Timestamps/Watermarks -> Map (1/1) (302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on not deployed. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources. at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1] Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242] ... 25 more Caused by: java.util.concurrent.TimeoutException ... 23 more 2020-07-23 14:01:17,848 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0. 2020-07-23 14:01:17,910 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 902 tasks should be restarted to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0. 2020-07-23 14:01:17,913 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to FAILING. org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1] Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources. at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.1.jar:1.11.1] ... 45 more Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242] ... 25 more Caused by: java.util.concurrent.TimeoutException ... 23 more 2020-07-23 14:01:18,109 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Discarding the results produced by task execution 1809eb912d69854f2babedeaf879df6a. 2020-07-23 14:01:18,110 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to FAILED. org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242] at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1] Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources. at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.1.jar:1.11.1] ... 45 more Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242] ... 25 more Caused by: java.util.concurrent.TimeoutException ... 23 more 2020-07-23 14:01:18,114 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Stopping checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 14:01:18,117 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down 2020-07-23 14:01:18,118 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Discarding the results produced by task execution 302ca9640e2d209a543d843f2996ccd2. 2020-07-23 14:01:18,120 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out. 2020-07-23 14:01:18,120 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out. 2020-07-23 14:01:18,120 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out. 2020-07-23 14:01:18,121 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out. 2020-07-23 14:01:18,121 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out. 2020-07-23 14:01:18,121 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out. 2020-07-23 14:01:18,122 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out. 2020-07-23 14:01:18,122 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Pending slot request [ 2020-07-23 14:01:18,151 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 14:01:18,157 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 14:01:18,157 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 14:01:18,157 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED. 2020-07-23 14:01:18,162 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56. 2020-07-23 14:01:18,162 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000. 2020-07-23 14:01:18,225 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job JobTest(99a030d0e3f428490a501c0132f27a56). 2020-07-23 14:01:18,381 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Suspending SlotPool. 2020-07-23 14:01:18,382 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection 83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down.. 2020-07-23 14:01:18,382 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Stopping SlotPool. 2020-07-23 14:01:18,382 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager 00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56 from the resource manager. | | a511955993 | | 邮箱:a511955...@163.com | 签名由 网易邮箱大师 定制 On 07/23/2020 13:26, Yang Wang wrote: 很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。 我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。 注意这里不是解析hostname,而是通过ip地址来反解析进行验证 回答你说的两个问题: 1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常 2. 如果没有配置taskmanager.bind-host, [Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址 如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析 Best, Yang SmileSmile <a511955...@163.com> 于2020年7月23日周四 上午11:30写道: > > Hi Yang Wang > > 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。 > > 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No > hostname could be resolved for ip > address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。 > > > 1. 如果按照上面的情况,那么这个配置文件是必须配置的? > > 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154] > 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动, > 需要JM去通过TM上报的ip反向解析出service? > > > Bset! > > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html > > a511955993 > 邮箱:a511955...@163.com > > <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D> > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制 > > On 07/23/2020 10:11, Yang Wang <danrtsey...@gmail.com> wrote: > 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod, > 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的 > 问题了 > > kubectl run -i -t busybox --image=busybox --restart=Never > > 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的 > > > > Best, > Yang > > > SmileSmile <a511955...@163.com> 于2020年7月22日周三 下午7:57写道: > > > > > Hi,Yang Wang! > > > > 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。 > > > > 在JM报错的地方,No hostname could be resolved for ip address xxxxx > > ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢 > > > > Best! > > > > > > a511955993 > > 邮箱:a511955...@163.com > > > > < > https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D> > > > > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制 > > > > On 07/22/2020 18:18, Yang Wang <danrtsey...@gmail.com> wrote: > > 如果你的日志里面一直在刷No hostname could be resolved for the IP > address,应该是集群的coredns > > 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有 > > 可能是coredns有问题 > > > > > > Best, > > Yang > > > > Congxian Qiu <qcx978132...@gmail.com> 于2020年7月21日周二 下午7:29写道: > > > > > Hi > > > 不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod > > > 的完整日志有没有什么发现 > > > Best, > > > Congxian > > > > > > > > > SmileSmile <a511955...@163.com> 于2020年7月21日周二 下午3:19写道: > > > > > > > Hi,Congxian > > > > > > > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be > > > > resolved,jm失联,作业提交失败。 > > > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。 > > > > > > > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。 > > > > > > > > > > > > 是否有其他排查思路? > > > > > > > > Best! > > > > > > > > > > > > > > > > > > > > | | > > > > a511955993 > > > > | > > > > | > > > > 邮箱:a511955...@163.com > > > > | > > > > > > > > 签名由 网易邮箱大师 定制 > > > > > > > > On 07/16/2020 13:17, Congxian Qiu wrote: > > > > Hi > > > > 如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk > 的日志。之前遇到过一次在 > > > Yarn > > > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。 > > > > > > > > Best, > > > > Congxian > > > > > > > > > > > > SmileSmile <a511955...@163.com> 于2020年7月15日周三 下午5:20写道: > > > > > > > > > Hi Roc > > > > > > > > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适 > > > > > > > > > > > > > > > > > > > > | | > > > > > a511955993 > > > > > | > > > > > | > > > > > 邮箱:a511955...@163.com > > > > > | > > > > > > > > > > 签名由 网易邮箱大师 定制 > > > > > > > > > > On 07/15/2020 17:16, Roc Marshal wrote: > > > > > Hi,SmileSmile. > > > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。 > > > > > 希望这对你有帮助。 > > > > > > > > > > > > > > > 祝好。 > > > > > Roc Marshal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 在 2020-07-15 17:04:18,"SmileSmile" <a511955...@163.com> 写道: > > > > > > > > > > > >Hi > > > > > > > > > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job > > > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP > > address,JM > > > > time > > > > > out,作业提交失败。web ui也会卡主无响应。 > > > > > > > > > > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。 > > > > > > > > > > > > > > > > > >部分日志如下: > > > > > > > > > > > >2020-07-15 16:58:46,460 WARN > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - > No > > > > > hostname could be resolved for the IP address 10.32.160.7, using > IP > > > > address > > > > > as host name. Local input split assignment (such as for HDFS > files) > > may > > > > be > > > > > impacted. > > > > > >2020-07-15 16:58:46,460 WARN > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - > No > > > > > hostname could be resolved for the IP address 10.44.224.7, using > IP > > > > address > > > > > as host name. Local input split assignment (such as for HDFS > files) > > may > > > > be > > > > > impacted. > > > > > >2020-07-15 16:58:46,461 WARN > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation [] - > No > > > > > hostname could be resolved for the IP address 10.40.32.9, using IP > > > > address > > > > > as host name. Local input split assignment (such as for HDFS > files) > > may > > > > be > > > > > impacted. > > > > > > > > > > > >2020-07-15 16:59:10,236 INFO > > > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager > > [] - > > > > The > > > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a > > timed > > > > out. > > > > > >2020-07-15 16:59:10,236 INFO > > > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager > > [] - > > > > > Disconnect job manager 00000000000000000000000000000000 > > > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for > > job > > > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager. > > > > > > > > > > > > > > > > > >how to deal with ? > > > > > > > > > > > > > > > > > >beset ! > > > > > > > > > > > >| | > > > > > >a511955993 > > > > > >| > > > > > >| > > > > > >邮箱:a511955...@163.com > > > > > >| > > > > > > > > > > > >签名由 网易邮箱大师 定制 > > > > > > > > > > > > > > > > > >