[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417542#comment-16417542 ] ASF GitHub Bot commented on FLINK-9099: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/5775 > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417250#comment-16417250 ] ASF GitHub Bot commented on FLINK-9099: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/5775 Thanks for the review @GJL. Rebasing onto the latest master and merging. > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417249#comment-16417249 ] ASF GitHub Bot commented on FLINK-9099: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/5775#discussion_r177726389 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java --- @@ -465,6 +464,58 @@ public void testSchedulingOperationCancellationWhenCancel() throws Exception { assertThat(executionGraph.getTerminationFuture().get(), is(JobStatus.CANCELED)); } + @Nonnull + private TestingLogicalSlot createTestingSlot(@Nullable CompletableFuture releaseFuture) { + return new TestingLogicalSlot( + new LocalTaskManagerLocation(), + new SimpleAckingTaskManagerGateway(), + 0, + new AllocationID(), + new SlotRequestId(), + new SlotSharingGroupId(), + releaseFuture); + } + + /** +* Tests that a partially completed eager scheduling operation fails if an +* completed slot is released. See FLINK-9099. +*/ + @Test + public void testSlotReleasingFailsSchedulingOperation() throws Exception { + final int parallelism = 2; + + final JobVertex jobVertex = new JobVertex("Testing job vertex"); + jobVertex.setInvokableClass(NoOpInvokable.class); + jobVertex.setParallelism(parallelism); + final JobGraph jobGraph = new JobGraph(jobVertex); + jobGraph.setAllowQueuedScheduling(true); + jobGraph.setScheduleMode(ScheduleMode.EAGER); + + final ProgrammedSlotProvider slotProvider = new ProgrammedSlotProvider(2); --- End diff -- Good catch. Will fix it. > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417124#comment-16417124 ] ASF GitHub Bot commented on FLINK-9099: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/5775#discussion_r177699910 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java --- @@ -659,7 +660,7 @@ public void disconnectResourceManager() { .orTimeout(pendingRequest.getAllocatedSlotFuture(), allocationTimeout.toMilliseconds(), TimeUnit.MILLISECONDS) .whenCompleteAsync( (AllocatedSlot ignored, Throwable throwable) -> { - if (throwable != null) { + if (throwable instanceof TimeoutException) { --- End diff -- This callback is only intended to react to timeouts of the slot allocation future. Since we return the future to the user, any other exception should be visible. > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416128#comment-16416128 ] ASF GitHub Bot commented on FLINK-9099: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/5775#discussion_r177544954 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java --- @@ -465,6 +464,58 @@ public void testSchedulingOperationCancellationWhenCancel() throws Exception { assertThat(executionGraph.getTerminationFuture().get(), is(JobStatus.CANCELED)); } + @Nonnull + private TestingLogicalSlot createTestingSlot(@Nullable CompletableFuture releaseFuture) { --- End diff -- nit: make static and move below: ``` // // Utilities // ``` > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416126#comment-16416126 ] ASF GitHub Bot commented on FLINK-9099: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/5775#discussion_r177544459 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java --- @@ -465,6 +464,58 @@ public void testSchedulingOperationCancellationWhenCancel() throws Exception { assertThat(executionGraph.getTerminationFuture().get(), is(JobStatus.CANCELED)); } + @Nonnull + private TestingLogicalSlot createTestingSlot(@Nullable CompletableFuture releaseFuture) { + return new TestingLogicalSlot( + new LocalTaskManagerLocation(), + new SimpleAckingTaskManagerGateway(), + 0, + new AllocationID(), + new SlotRequestId(), + new SlotSharingGroupId(), + releaseFuture); + } + + /** +* Tests that a partially completed eager scheduling operation fails if an --- End diff -- nit: *[...] if an completed [...]* > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416124#comment-16416124 ] ASF GitHub Bot commented on FLINK-9099: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/5775#discussion_r177544274 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java --- @@ -465,6 +464,58 @@ public void testSchedulingOperationCancellationWhenCancel() throws Exception { assertThat(executionGraph.getTerminationFuture().get(), is(JobStatus.CANCELED)); } + @Nonnull + private TestingLogicalSlot createTestingSlot(@Nullable CompletableFuture releaseFuture) { + return new TestingLogicalSlot( + new LocalTaskManagerLocation(), + new SimpleAckingTaskManagerGateway(), + 0, + new AllocationID(), + new SlotRequestId(), + new SlotSharingGroupId(), + releaseFuture); + } + + /** +* Tests that a partially completed eager scheduling operation fails if an +* completed slot is released. See FLINK-9099. +*/ + @Test + public void testSlotReleasingFailsSchedulingOperation() throws Exception { + final int parallelism = 2; + + final JobVertex jobVertex = new JobVertex("Testing job vertex"); + jobVertex.setInvokableClass(NoOpInvokable.class); + jobVertex.setParallelism(parallelism); + final JobGraph jobGraph = new JobGraph(jobVertex); + jobGraph.setAllowQueuedScheduling(true); + jobGraph.setScheduleMode(ScheduleMode.EAGER); + + final ProgrammedSlotProvider slotProvider = new ProgrammedSlotProvider(2); --- End diff -- Replace `2` with `parallelism`? > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416123#comment-16416123 ] ASF GitHub Bot commented on FLINK-9099: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/5775#discussion_r177543886 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java --- @@ -659,7 +660,7 @@ public void disconnectResourceManager() { .orTimeout(pendingRequest.getAllocatedSlotFuture(), allocationTimeout.toMilliseconds(), TimeUnit.MILLISECONDS) .whenCompleteAsync( (AllocatedSlot ignored, Throwable throwable) -> { - if (throwable != null) { + if (throwable instanceof TimeoutException) { --- End diff -- Are we not losing some information by swallowing other types of exceptions? > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed
[ https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415640#comment-16415640 ] ASF GitHub Bot commented on FLINK-9099: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/5775 [FLINK-9099] Assign Execution to LogicalSlot when slot is assigned to Execution ## What is the purpose of the change In order to fail fast if an allocated slot is released by the SlotPool, we assign the Execution as payload to a LogicalSlot when the slot is assigned to the Execution. cc @GJL ## Verifying this change - Added `ExecutionGraphSchedulingTest#testSlotReleasingFailsSchedulingOperation` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixSchedulingDeadlock Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5775.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5775 > Failing allocated slots not noticed > --- > > Key: FLINK-9099 > URL: https://issues.apache.org/jira/browse/FLINK-9099 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: flip-6 > Fix For: 1.5.0 > > > When allocating slots for eager scheduling, it can happen that allocated > slots get failed after they are assigned to the {{Execution}} (e.g. due to a > {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot > futures, then this will not be noticed since the {{Execution}} is assigned to > the {{LogicalSlot}} only after all slot futures are completed. Therefore, the > allocated slot failure will go unnoticed until this happens. > In order to speed up failures, we should directly assign the {{Execution}} to > the {{LogicalSlot}} once the slot is assigned to the {{Execution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)