[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled
[ https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000718#comment-17000718 ] Zhu Zhu commented on FLINK-15320: - The PR is opened. [~trohrmann] I checked the code and found the only possible entry points which cancel vertices out of {{DefaultScheduler#restartTasksWithDelay()}} (the method would increment vertex versions) are 3 methods in SchedulerBase: {{#failJob()}}, {{#cancel()}} and {{#suspend()}}. So I added `SchedulerBase#incrementVersionsOfAllVertices()` and use it in those 3 cases above. In this way, outdated deployments can always be identified by checking the vertex versions and this unexpected issue would not happen. > JobManager crashes in the standalone model when cancelling job which subtask' > status is scheduled > - > > Key: FLINK-15320 > URL: https://issues.apache.org/jira/browse/FLINK-15320 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: lining >Assignee: Zhu Zhu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Use start-cluster.sh to start a standalone cluster, and then submit a job > from the streaming's example which name is TopSpeedWindowing, parallelism is > 20. Wait for one minute, cancel the job, jobmanager will crash. The exception > stack is: > 2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > Could not assign resource > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to > current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, > TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print > to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at > org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716) > at >
[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled
[ https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999898#comment-16999898 ] Zhu Zhu commented on FLINK-15320: - Thanks [~trohrmann]. And also thanks [~lining] for reporting this issue. > JobManager crashes in the standalone model when cancelling job which subtask' > status is scheduled > - > > Key: FLINK-15320 > URL: https://issues.apache.org/jira/browse/FLINK-15320 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: lining >Assignee: Zhu Zhu >Priority: Blocker > Fix For: 1.10.0 > > > Use start-cluster.sh to start a standalone cluster, and then submit a job > from the streaming's example which name is TopSpeedWindowing, parallelism is > 20. Wait for one minute, cancel the job, jobmanager will crash. The exception > stack is: > 2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > Could not assign resource > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to > current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, > TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print > to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at > org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.releaseSharedSlot(SchedulerImpl.java:552) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.cancelSlotRequest(SchedulerImpl.java:184) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.returnLogicalSlot(SchedulerImpl.java:195) > at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:181) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at >
[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled
[ https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999875#comment-16999875 ] Till Rohrmann commented on FLINK-15320: --- This is a very good find and is indeed a blocker. Gary is already on vacation. I try to find some time to look into your proposed solution. > JobManager crashes in the standalone model when cancelling job which subtask' > status is scheduled > - > > Key: FLINK-15320 > URL: https://issues.apache.org/jira/browse/FLINK-15320 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: lining >Assignee: Zhu Zhu >Priority: Blocker > Fix For: 1.10.0 > > > Use start-cluster.sh to start a standalone cluster, and then submit a job > from the streaming's example which name is TopSpeedWindowing, parallelism is > 20. Wait for one minute, cancel the job, jobmanager will crash. The exception > stack is: > 2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > Could not assign resource > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to > current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, > TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print > to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at > org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.releaseSharedSlot(SchedulerImpl.java:552) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.cancelSlotRequest(SchedulerImpl.java:184) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.returnLogicalSlot(SchedulerImpl.java:195) > at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:181) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >
[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled
[ https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999815#comment-16999815 ] Zhu Zhu commented on FLINK-15320: - To summarize, this issue happens when an external job cancel request comes when the job is still allocating slots. The root cause is that, the version of the canceled vertices are not incremented in the case of external job cancel request, and the pending slot requests are also not canceled in this case, so that the returned slot can be used to fulfill an outdated deployment, which finally triggers the fatal error. To fix it, I think we should always increment the version of a vertex before canceling/failing it. Besides {{JobMaster#cancel}}, there are several other cases including {{JobMaster#suspend}} and {{ExecutionGraph#failJob}}. cc [~gjy] > JobManager crashes in the standalone model when cancelling job which subtask' > status is scheduled > - > > Key: FLINK-15320 > URL: https://issues.apache.org/jira/browse/FLINK-15320 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.10.0 >Reporter: lining >Assignee: Zhu Zhu >Priority: Blocker > Fix For: 1.10.0 > > > Use start-cluster.sh to start a standalone cluster, and then submit a job > from the streaming's example which name is TopSpeedWindowing, parallelism is > 20. Wait for one minute, cancel the job, jobmanager will crash. The exception > stack is: > 2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...2019-12-19 10:12:11,060 ERROR > org.apache.flink.runtime.util.FatalExitExceptionHandler - FATAL: Thread > 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. > Stopping the process...java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > Could not assign resource > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to > current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, > TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print > to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at > org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716) > at >