[ https://issues.apache.org/jira/browse/FLINK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhu Zhu closed FLINK-23806. --------------------------- Resolution: Fixed Fixed via master/release-1.14: f543e9a97e2d2dda340d4d1d54467ffe060666cb release-1.13: de16f34193799e7f3aade15b9bc57549f8010621 release-1.12: 5e83f3e6f3d9bef893a28e68b6ed2534589f1e30 > StackOverflowException can happen if a large scale job failed to acquire > enough slots in time > --------------------------------------------------------------------------------------------- > > Key: FLINK-23806 > URL: https://issues.apache.org/jira/browse/FLINK-23806 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.5, 1.13.2 > Reporter: Zhu Zhu > Assignee: Zhu Zhu > Priority: Critical > Labels: pull-request-available > Fix For: 1.14.0, 1.12.6, 1.13.3 > > > When requested slots are not fulfilled in time, task failure will be > triggered and all related tasks will be canceled and restarted. However, in > this process, if a task is already assigned a slot, the slot will be returned > to the slot pool and it will be immediately used to fulfill pending slot > requests of the tasks which will soon be canceled. The execution version of > those tasks are already bumped in > {{DefaultScheduler#restartTasksWithDelay(...)}} so that the assignment will > fail immediately and the slot will be returned to the slot pool and again > used to fulfill pending slot requests. StackOverflow can happen in this way > when there are many vertices, and fatal error can happen and lead to JM > crash. A sample call stack is attached below. > To fix the problem, one way is to cancel the pending requests of all the > tasks which will be canceled soon(i.e. tasks with version bumped) before > canceling these tasks. > {panel} > ... > at > org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.releaseSharedSlot(SlotSharingExecutionSlotAllocator.java:242) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SharedSlot.releaseExternally(SharedSlot.java:281) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SharedSlot.removeLogicalSlotRequest(SharedSlot.java:242) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SharedSlot.returnLogicalSlot(SharedSlot.java:234) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:203) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) > ~[?:1.8.0_102] > at > java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717) > ~[?:1.8.0_102] > at > java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) > ~[?:1.8.0_102] > at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.returnSlotToOwner(SingleLogicalSlot.java:200) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.releaseSlot(SingleLogicalSlot.java:130) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.DefaultScheduler.releaseSlotIfPresent(DefaultScheduler.java:542) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$8(DefaultScheduler.java:505) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > ~[?:1.8.0_102] > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > ~[?:1.8.0_102] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > ~[?:1.8.0_102] > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > ~[?:1.8.0_102] > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequest.fulfill(DeclarativeSlotPoolBridge.java:552) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequestSlotMatching.fulfillPendingRequest(DeclarativeSlotPoolBridge.java:587) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.newSlotsAreAvailable(DeclarativeSlotPoolBridge.java:171) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.lambda$freeReservedSlot$0(DefaultDeclarativeSlotPool.java:316) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_102] > at > org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeReservedSlot(DefaultDeclarativeSlotPool.java:313) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.releaseSlot(DeclarativeSlotPoolBridge.java:335) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112) > ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT] > ... > {panel} -- This message was sent by Atlassian Jira (v8.3.4#803005)