[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417542#comment-16417542
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/5775


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417250#comment-16417250
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user tillrohrmann commented on the issue:

https://github.com/apache/flink/pull/5775
  
Thanks for the review @GJL. Rebasing onto the latest master and merging.


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417249#comment-16417249
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/5775#discussion_r177726389
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java
 ---
@@ -465,6 +464,58 @@ public void 
testSchedulingOperationCancellationWhenCancel() throws Exception {
assertThat(executionGraph.getTerminationFuture().get(), 
is(JobStatus.CANCELED));
}
 
+   @Nonnull
+   private TestingLogicalSlot createTestingSlot(@Nullable 
CompletableFuture releaseFuture) {
+   return new TestingLogicalSlot(
+   new LocalTaskManagerLocation(),
+   new SimpleAckingTaskManagerGateway(),
+   0,
+   new AllocationID(),
+   new SlotRequestId(),
+   new SlotSharingGroupId(),
+   releaseFuture);
+   }
+
+   /**
+* Tests that a partially completed eager scheduling operation fails if 
an
+* completed slot is released. See FLINK-9099.
+*/
+   @Test
+   public void testSlotReleasingFailsSchedulingOperation() throws 
Exception {
+   final int parallelism = 2;
+
+   final JobVertex jobVertex = new JobVertex("Testing job vertex");
+   jobVertex.setInvokableClass(NoOpInvokable.class);
+   jobVertex.setParallelism(parallelism);
+   final JobGraph jobGraph = new JobGraph(jobVertex);
+   jobGraph.setAllowQueuedScheduling(true);
+   jobGraph.setScheduleMode(ScheduleMode.EAGER);
+
+   final ProgrammedSlotProvider slotProvider = new 
ProgrammedSlotProvider(2);
--- End diff --

Good catch. Will fix it.


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417124#comment-16417124
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/5775#discussion_r177699910
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java
 ---
@@ -659,7 +660,7 @@ public void disconnectResourceManager() {
.orTimeout(pendingRequest.getAllocatedSlotFuture(), 
allocationTimeout.toMilliseconds(), TimeUnit.MILLISECONDS)
.whenCompleteAsync(
(AllocatedSlot ignored, Throwable throwable) -> 
{
-   if (throwable != null) {
+   if (throwable instanceof 
TimeoutException) {
--- End diff --

This callback is only intended to react to timeouts of the slot allocation 
future. Since we return the future to the user, any other exception should be 
visible.


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416128#comment-16416128
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5775#discussion_r177544954
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java
 ---
@@ -465,6 +464,58 @@ public void 
testSchedulingOperationCancellationWhenCancel() throws Exception {
assertThat(executionGraph.getTerminationFuture().get(), 
is(JobStatus.CANCELED));
}
 
+   @Nonnull
+   private TestingLogicalSlot createTestingSlot(@Nullable 
CompletableFuture releaseFuture) {
--- End diff --

nit: make static and move below:
```
// 

//  Utilities
// 

```


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416126#comment-16416126
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5775#discussion_r177544459
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java
 ---
@@ -465,6 +464,58 @@ public void 
testSchedulingOperationCancellationWhenCancel() throws Exception {
assertThat(executionGraph.getTerminationFuture().get(), 
is(JobStatus.CANCELED));
}
 
+   @Nonnull
+   private TestingLogicalSlot createTestingSlot(@Nullable 
CompletableFuture releaseFuture) {
+   return new TestingLogicalSlot(
+   new LocalTaskManagerLocation(),
+   new SimpleAckingTaskManagerGateway(),
+   0,
+   new AllocationID(),
+   new SlotRequestId(),
+   new SlotSharingGroupId(),
+   releaseFuture);
+   }
+
+   /**
+* Tests that a partially completed eager scheduling operation fails if 
an
--- End diff --

nit: *[...] if an completed [...]*


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416124#comment-16416124
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5775#discussion_r177544274
  
--- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java
 ---
@@ -465,6 +464,58 @@ public void 
testSchedulingOperationCancellationWhenCancel() throws Exception {
assertThat(executionGraph.getTerminationFuture().get(), 
is(JobStatus.CANCELED));
}
 
+   @Nonnull
+   private TestingLogicalSlot createTestingSlot(@Nullable 
CompletableFuture releaseFuture) {
+   return new TestingLogicalSlot(
+   new LocalTaskManagerLocation(),
+   new SimpleAckingTaskManagerGateway(),
+   0,
+   new AllocationID(),
+   new SlotRequestId(),
+   new SlotSharingGroupId(),
+   releaseFuture);
+   }
+
+   /**
+* Tests that a partially completed eager scheduling operation fails if 
an
+* completed slot is released. See FLINK-9099.
+*/
+   @Test
+   public void testSlotReleasingFailsSchedulingOperation() throws 
Exception {
+   final int parallelism = 2;
+
+   final JobVertex jobVertex = new JobVertex("Testing job vertex");
+   jobVertex.setInvokableClass(NoOpInvokable.class);
+   jobVertex.setParallelism(parallelism);
+   final JobGraph jobGraph = new JobGraph(jobVertex);
+   jobGraph.setAllowQueuedScheduling(true);
+   jobGraph.setScheduleMode(ScheduleMode.EAGER);
+
+   final ProgrammedSlotProvider slotProvider = new 
ProgrammedSlotProvider(2);
--- End diff --

Replace `2` with `parallelism`?


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416123#comment-16416123
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5775#discussion_r177543886
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java
 ---
@@ -659,7 +660,7 @@ public void disconnectResourceManager() {
.orTimeout(pendingRequest.getAllocatedSlotFuture(), 
allocationTimeout.toMilliseconds(), TimeUnit.MILLISECONDS)
.whenCompleteAsync(
(AllocatedSlot ignored, Throwable throwable) -> 
{
-   if (throwable != null) {
+   if (throwable instanceof 
TimeoutException) {
--- End diff --

Are we not losing some information by swallowing other types of exceptions?


> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9099) Failing allocated slots not noticed

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415640#comment-16415640
 ] 

ASF GitHub Bot commented on FLINK-9099:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/5775

[FLINK-9099] Assign Execution to LogicalSlot when slot is assigned to 
Execution

## What is the purpose of the change

In order to fail fast if an allocated slot is released by the SlotPool, we 
assign the
Execution as payload to a LogicalSlot when the slot is assigned to the 
Execution.

cc @GJL 

## Verifying this change

- Added 
`ExecutionGraphSchedulingTest#testSlotReleasingFailsSchedulingOperation`

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink fixSchedulingDeadlock

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5775






> Failing allocated slots not noticed
> ---
>
> Key: FLINK-9099
> URL: https://issues.apache.org/jira/browse/FLINK-9099
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> When allocating slots for eager scheduling, it can happen that allocated 
> slots get failed after they are assigned to the {{Execution}} (e.g. due to a 
> {{TaskExecutor}} heartbeat timeout). If there are still some uncompleted slot 
> futures, then this will not be noticed since the {{Execution}} is assigned to 
> the {{LogicalSlot}} only after all slot futures are completed. Therefore, the 
> allocated slot failure will go unnoticed until this happens.
> In order to speed up failures, we should directly assign the {{Execution}} to 
> the {{LogicalSlot}} once the slot is assigned to the {{Execution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)