有一个线上flink作业在人为主动创建保存点时失败,作业有两个算子:从kafka读取数据和写到mongodb,都是48个并行度,出错后查看到写mongodb算子一共48个task,完成了45个,还有3个tasks超时(超时时长设为3分钟),正常情况下完成一次checkpoint要4秒,状态大小只有23.7kb。出错后,查看作业日志如下。在创建保存点失败后作业周期性的检查点生成也都失败了(每个算子各有3个tasks超时)。使用的是FileStateBackend,DFS用的是阿里云oss。请问出错会是因为什么原因造成的?


+5
[2022-08-29 15:38:32]
content: 
2022-08-29 15:38:32,617 ERROR 
org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerStdoutFileHandler 
[] - Failed to transfer file from TaskExecutor 
sqrc-session-prod-taskmanager-1-30.
+6
[2022-08-29 15:38:32]
content: 
java.util.concurrent.CompletionException: org.apache.flink.util.FlinkException: 
The file STDOUT does not exist on the TaskExecutor.
+7
[2022-08-29 15:38:32]
content: 
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$requestFileUploadByFilePath$24(TaskExecutor.java:2064)
 ~[flink-dist_2.12-1.13.2.jar:1.13.2]
+8
[2022-08-29 15:38:32]
content: 
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
 ~[?:1.8.0_312]
+9
[2022-08-29 15:38:32]
content: 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_312]
+10
[2022-08-29 15:38:32]
content: 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_312]
+11
[2022-08-29 15:38:32]
content: 
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_312]
+12
[2022-08-29 15:38:32]
content: 
Caused by: org.apache.flink.util.FlinkException: The file STDOUT does not exist 
on the TaskExecutor.
+13
[2022-08-29 15:38:32]
content: 
... 5 more
+14
[2022-08-29 15:38:32]
content: 
2022-08-29 15:38:32,617 ERROR 
org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerStdoutFileHandler 
[] - Unhandled exception.
+15
[2022-08-29 15:38:32]
content: 
org.apache.flink.util.FlinkException: The file STDOUT does not exist on the 
TaskExecutor.
+16
[2022-08-29 15:38:32]
content: 
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$requestFileUploadByFilePath$24(TaskExecutor.java:2064)
 ~[flink-dist_2.12-1.13.2.jar:1.13.2]
+17
[2022-08-29 15:38:32]
content: 
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
 ~[?:1.8.0_312]
+18
[2022-08-29 15:38:32]
content: 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_312]
+19
[2022-08-29 15:38:32]
content: 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_312]
+20
[2022-08-29 15:38:32]
content: 
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_312]

回复