[ https://issues.apache.org/jira/browse/FLINK-20672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251694#comment-17251694 ]
Yun Tang commented on FLINK-20672: ---------------------------------- I think you're right and this might happen in extra case. I think the solution might be catching any exception when sending these abort messages just like current disposing/discarding checkpoint in the executor. > CheckpointAborted RPC failure can fail JM > ----------------------------------------- > > Key: FLINK-20672 > URL: https://issues.apache.org/jira/browse/FLINK-20672 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.12.0, 1.11.3 > Reporter: Roman Khachatryan > Priority: Major > > Introduced in FLINK-8871, aborted RPC notifications are done asynchonously: > > {code} > private void sendAbortedMessages(long checkpointId, long timeStamp) { > // send notification of aborted checkpoints asynchronously. > executor.execute(() -> { > // send the "abort checkpoint" messages to necessary > vertices. > // .. > }); > } > {code} > However, the executor that eventually executes this request is created as > follows > {code} > final ScheduledExecutorService futureExecutor = > Executors.newScheduledThreadPool( > Hardware.getNumberCPUCores(), > new ExecutorThreadFactory("jobmanager-future")); > {code} > ExecutorThreadFactory uses UncaughtExceptionHandler that exits JVM on error. > cc: [~yunta] -- This message was sent by Atlassian Jira (v8.3.4#803005)