[ 
https://issues.apache.org/jira/browse/FLINK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-35159:
-----------------------------------
    Labels: pull-request-available  (was: )

> CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash
> ------------------------------------------------------------------------
>
>                 Key: FLINK-35159
>                 URL: https://issues.apache.org/jira/browse/FLINK-35159
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> When a task manager dies while the JM is generating an ExecutionGraph in the 
> background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can 
> transition back into WaitingForResources if the TM hosted one of the slots 
> that we planned to use in {{tryToAssignSlots}}.
> At this point the ExecutionGraph was already transitioned to running, which 
> implicitly kicks of periodic checkpointing by the CheckpointCoordinator, 
> without the operator coordinator holders being initialized yet (as this 
> happens after we assigned slots).
> This effectively leaks that CheckpointCoordinator, including the timer thread 
> that will continue to try triggering checkpoints, which will naturally fail 
> to trigger.
> This can cause a JM crash because it results in 
> {{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which 
> fails with an NPE since the {{mainThreadExecutor}} was not initialized yet.
> {code}
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.NullPointerException
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
>       at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.NullPointerException
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>       ... 7 more
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
>       at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
>       at 
> java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
>       at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
>       ... 8 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to