[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled

2019-12-20 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000718#comment-17000718
 ] 

Zhu Zhu commented on FLINK-15320:
-

The PR is opened. [~trohrmann]
I checked the code and found the only possible entry points which cancel 
vertices out of {{DefaultScheduler#restartTasksWithDelay()}} (the method would 
increment vertex versions) are 3 methods in SchedulerBase: {{#failJob()}}, 
{{#cancel()}} and {{#suspend()}}.
So I added `SchedulerBase#incrementVersionsOfAllVertices()` and use it in those 
3 cases above. In this way, outdated deployments can always be identified by 
checking the vertex versions and this unexpected issue would not happen.

> JobManager crashes in the standalone model when cancelling job which subtask' 
> status is scheduled
> -
>
> Key: FLINK-15320
> URL: https://issues.apache.org/jira/browse/FLINK-15320
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: lining
>Assignee: Zhu Zhu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Use start-cluster.sh to start a standalone cluster, and then submit a job 
> from the streaming's example which name is TopSpeedWindowing, parallelism is 
> 20. Wait for one minute, cancel the job, jobmanager will crash. The exception 
> stack is:
> 2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> Could not assign resource 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to 
> current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, 
> TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print 
> to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387)
>  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716)
>  at 
> 

[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled

2019-12-19 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999898#comment-16999898
 ] 

Zhu Zhu commented on FLINK-15320:
-

Thanks [~trohrmann]. And also thanks [~lining] for reporting this issue.

> JobManager crashes in the standalone model when cancelling job which subtask' 
> status is scheduled
> -
>
> Key: FLINK-15320
> URL: https://issues.apache.org/jira/browse/FLINK-15320
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: lining
>Assignee: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
>
> Use start-cluster.sh to start a standalone cluster, and then submit a job 
> from the streaming's example which name is TopSpeedWindowing, parallelism is 
> 20. Wait for one minute, cancel the job, jobmanager will crash. The exception 
> stack is:
> 2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> Could not assign resource 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to 
> current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, 
> TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print 
> to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387)
>  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.releaseSharedSlot(SchedulerImpl.java:552)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.cancelSlotRequest(SchedulerImpl.java:184)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.returnLogicalSlot(SchedulerImpl.java:195)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:181)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> 

[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled

2019-12-19 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999875#comment-16999875
 ] 

Till Rohrmann commented on FLINK-15320:
---

This is a very good find and is indeed a blocker. Gary is already on vacation. 
I try to find some time to look into your proposed solution.

> JobManager crashes in the standalone model when cancelling job which subtask' 
> status is scheduled
> -
>
> Key: FLINK-15320
> URL: https://issues.apache.org/jira/browse/FLINK-15320
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: lining
>Assignee: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
>
> Use start-cluster.sh to start a standalone cluster, and then submit a job 
> from the streaming's example which name is TopSpeedWindowing, parallelism is 
> 20. Wait for one minute, cancel the job, jobmanager will crash. The exception 
> stack is:
> 2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> Could not assign resource 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to 
> current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, 
> TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print 
> to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387)
>  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.releaseSharedSlot(SchedulerImpl.java:552)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.cancelSlotRequest(SchedulerImpl.java:184)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.returnLogicalSlot(SchedulerImpl.java:195)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:181)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> 

[jira] [Commented] (FLINK-15320) JobManager crashes in the standalone model when cancelling job which subtask' status is scheduled

2019-12-18 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999815#comment-16999815
 ] 

Zhu Zhu commented on FLINK-15320:
-

To summarize, this issue happens when an external job cancel request comes when 
the job is still allocating slots.
The root cause is that, the version of the canceled vertices are not 
incremented in the case of external job cancel request, and the pending slot 
requests are also not canceled in this case, so that the returned slot can be 
used to fulfill an outdated deployment, which finally triggers the fatal error.
To fix it, I think we should always increment the version of a vertex before 
canceling/failing it. Besides {{JobMaster#cancel}}, there are several other 
cases including {{JobMaster#suspend}} and {{ExecutionGraph#failJob}}.

cc [~gjy]

> JobManager crashes in the standalone model when cancelling job which subtask' 
> status is scheduled
> -
>
> Key: FLINK-15320
> URL: https://issues.apache.org/jira/browse/FLINK-15320
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: lining
>Assignee: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
>
> Use start-cluster.sh to start a standalone cluster, and then submit a job 
> from the streaming's example which name is TopSpeedWindowing, parallelism is 
> 20. Wait for one minute, cancel the job, jobmanager will crash. The exception 
> stack is:
> 2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...2019-12-19 10:12:11,060 ERROR 
> org.apache.flink.runtime.util.FatalExitExceptionHandler       - FATAL: Thread 
> 'flink-akka.actor.default-dispatcher-2' produced an uncaught exception. 
> Stopping the process...java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> Could not assign resource 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@583585c6 to 
> current execution Attempt #0 (Window(GlobalWindows(), DeltaTrigger, 
> TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print 
> to Std. Out (19/20)) @ (unassigned) - [CANCELED]. at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.propagateIfNonNull(DefaultScheduler.java:387)
>  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$deployAll$4(DefaultScheduler.java:372)
>  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:705)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:170)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFulfillSlotRequestOrMakeAvailable(SlotPoolImpl.java:534)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSingleSlot(SlotPoolImpl.java:479)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseSlot(SlotPoolImpl.java:390)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:557)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.releaseChild(SlotSharingManager.java:607)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.access$700(SlotSharingManager.java:352)
>  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:716)
>  at 
>